Two related fixes for a stuck-open 'Monitoring' entry in the incident header:
1. serviceIsOk(GENERAL) is now called when a monitoring cycle completes
successfully. Previously GENERAL could only accumulate failures (via the
outer Throwable catch), with no complementary recovery, so once the
catch-all fired the service stayed red forever.
2. checkEdqs() is now wrapped in its own try/catch that reports any
non-ServiceFailureException failures under EDQS rather than GENERAL.
Connection/read timeouts hitting /api/entitiesQuery/find previously
propagated unwrapped and were bucketed as GENERAL, which hid the fact
that EDQS was the failing component.
The resolution countdown now only starts once every failing service has
recovered. While any service is still firing the timer is cancelled, so
the incident cannot auto-resolve and spawn a new incident for the same
ongoing failure between alerts.
High latency is a warning signal (no explicit recovery event) and
therefore does not block resolution — only FAILING services do.
Spring wraps Apache HttpClient connect failures as 'I/O error on POST request:
Connect to <URL> failed: <reason>'. Same treatment as the RestClient variant:
the URL is embedded in the 'request' word and the 'Connect to ... failed'
restatement is dropped from the visible text.
Recovered services are now shown as ':large_green_circle: <name> (<count>)'
with the last failure count that was observed before the service recovered,
matching the red-circle format.
Transforms 'request for "<URL>"' into Slack mrkdwn <URL|request>, so the
failure line renders the word 'request' as a clickable link instead of
showing the raw URL in quotes.
Adds Slack-API-based incident grouping for the monitoring microservice:
alerts that fire within resolution_timeout_s are threaded under a single
"Incident" message whose header tracks affected services in real time, and
the incident auto-resolves after a quiet period with a final summary.
Highlights
- Dual Slack modes: when bot_token + channel_id are set, alerts go through
chat.postMessage / chat.update with threaded replies; otherwise the
existing webhook path is used unchanged.
- Incident header shows 🔴 failing / :large_yellow_circle: high
latency / :large_green_circle: recovered services with live failure
counts and elapsed duration (updated every minute).
- Notifications carry structured affected-service data
(AffectedService { name, status, failureCount }) so the incident layer no
longer parses formatted alert text with regex.
- IncidentManager is decoupled from Slack via a small IncidentTransport
interface; SlackIncidentTransport adapts SlackApiClient.
- PE/other service-key types can plug in a friendly name via the
ShortNameProvider interface; TransportInfo implements it.
- Config lives under monitoring.notifications.incident.* (enabled,
resolution_timeout_s, tag_channel). Slack bot config stays under
monitoring.notifications.slack.{bot_token,channel_id}. YAML defaults are
authoritative; Spring @Value no longer carries a conflicting fallback.
- Concurrency: state and transport I/O run under the manager's monitor.
Slack client has explicit 5s call timeouts (set on SlackConfig) so the
hold time is bounded. Slack client is closed on PreDestroy.
- HTTP failure text is sanitised: HTML response bodies are stripped so
Nginx-style error pages don't flood alerts.
- BaseMonitoringService splits login / WS connect / WS subscribe into
distinct MonitoredServiceKey entries, uses catch(Exception) instead of
catch(Throwable), and wraps WsClient in try-with-resources.
- Unit tests cover incident lifecycle, status transitions, duration
formatting, HTML body stripping, and the ShortNameProvider dispatch.
Remove <pkg.skip.bootjar>false</pkg.skip.bootjar> from all child
module <properties> blocks. The root POM already defaults it to false,
and child declarations block the skip-pkg profile override, so
-Dpkg.skip=true was never actually skipping spring-boot:repackage.
Also remove the unused surefire.version property (superseded by
maven-surefire-plugin.version).
Introduces four independent flags to skip individual packaging artifacts:
-Dpkg.skip.bootjar=true skip spring-boot repackage (*-boot.jar)
-Dpkg.skip.deb=true skip Gradle buildDeb + Maven attach-artifact
-Dpkg.skip.rpm=true skip Gradle buildRpm
-Dpkg.skip.zip=true skip maven-assembly-plugin Windows ZIP
Adds -Dpkg.skip=true as a single convenience flag that sets all four
at once. msa/pom.xml mirrors the skip-pkg profile to override its own
<pkg.deb.phase>package</pkg.deb.phase> property (child POM properties
have higher priority than parent profile properties in Maven).
msa/* docker modules used ${basedir}/../.. (non-canonical) for main.dir.
maven-enforcer-plugin 3.5.0's osIndependentNameMatch() compares
file.toURI() vs file.getCanonicalFile().toURI() — these differ when the
path contains '..', causing RequireFilesExist to report false-negative.
Fix: replace ${basedir}/../.. with ${maven.multiModuleProjectDirectory}.