The resolution countdown now only starts once every failing service has
recovered. While any service is still firing the timer is cancelled, so
the incident cannot auto-resolve and spawn a new incident for the same
ongoing failure between alerts.
High latency is a warning signal (no explicit recovery event) and
therefore does not block resolution — only FAILING services do.
Spring wraps Apache HttpClient connect failures as 'I/O error on POST request:
Connect to <URL> failed: <reason>'. Same treatment as the RestClient variant:
the URL is embedded in the 'request' word and the 'Connect to ... failed'
restatement is dropped from the visible text.
Recovered services are now shown as ':large_green_circle: <name> (<count>)'
with the last failure count that was observed before the service recovered,
matching the red-circle format.
Transforms 'request for "<URL>"' into Slack mrkdwn <URL|request>, so the
failure line renders the word 'request' as a clickable link instead of
showing the raw URL in quotes.
Adds Slack-API-based incident grouping for the monitoring microservice:
alerts that fire within resolution_timeout_s are threaded under a single
"Incident" message whose header tracks affected services in real time, and
the incident auto-resolves after a quiet period with a final summary.
Highlights
- Dual Slack modes: when bot_token + channel_id are set, alerts go through
chat.postMessage / chat.update with threaded replies; otherwise the
existing webhook path is used unchanged.
- Incident header shows 🔴 failing / :large_yellow_circle: high
latency / :large_green_circle: recovered services with live failure
counts and elapsed duration (updated every minute).
- Notifications carry structured affected-service data
(AffectedService { name, status, failureCount }) so the incident layer no
longer parses formatted alert text with regex.
- IncidentManager is decoupled from Slack via a small IncidentTransport
interface; SlackIncidentTransport adapts SlackApiClient.
- PE/other service-key types can plug in a friendly name via the
ShortNameProvider interface; TransportInfo implements it.
- Config lives under monitoring.notifications.incident.* (enabled,
resolution_timeout_s, tag_channel). Slack bot config stays under
monitoring.notifications.slack.{bot_token,channel_id}. YAML defaults are
authoritative; Spring @Value no longer carries a conflicting fallback.
- Concurrency: state and transport I/O run under the manager's monitor.
Slack client has explicit 5s call timeouts (set on SlackConfig) so the
hold time is bounded. Slack client is closed on PreDestroy.
- HTTP failure text is sanitised: HTML response bodies are stripped so
Nginx-style error pages don't flood alerts.
- BaseMonitoringService splits login / WS connect / WS subscribe into
distinct MonitoredServiceKey entries, uses catch(Exception) instead of
catch(Throwable), and wraps WsClient in try-with-resources.
- Unit tests cover incident lifecycle, status transitions, duration
formatting, HTML body stripping, and the ShortNameProvider dispatch.