Browse Source

emit recovery signal for MonitoringServiceKey.GENERAL and route EDQS I/O errors correctly

Two related fixes for a stuck-open 'Monitoring' entry in the incident header:

1. serviceIsOk(GENERAL) is now called when a monitoring cycle completes
   successfully. Previously GENERAL could only accumulate failures (via the
   outer Throwable catch), with no complementary recovery, so once the
   catch-all fired the service stayed red forever.
2. checkEdqs() is now wrapped in its own try/catch that reports any
   non-ServiceFailureException failures under EDQS rather than GENERAL.
   Connection/read timeouts hitting /api/entitiesQuery/find previously
   propagated unwrapped and were bucketed as GENERAL, which hid the fact
   that EDQS was the failing component.
pull/15456/head
Oleksii Kuripko 2 months ago
parent
commit
df4dc25082
  1. 17
      monitoring/src/main/java/org/thingsboard/monitoring/service/BaseMonitoringService.java

17
monitoring/src/main/java/org/thingsboard/monitoring/service/BaseMonitoringService.java

@ -155,13 +155,22 @@ public abstract class BaseMonitoringService<C extends MonitoringConfig<T>, T ext
}
if (checkEdqs) {
stopWatch.start();
checkEdqs();
reporter.reportLatency(Latencies.EDQS_QUERY, stopWatch.getTime());
reporter.serviceIsOk(MonitoredServiceKey.EDQS);
try {
stopWatch.start();
checkEdqs();
reporter.reportLatency(Latencies.EDQS_QUERY, stopWatch.getTime());
reporter.serviceIsOk(MonitoredServiceKey.EDQS);
} catch (ServiceFailureException e) {
reporter.serviceFailure(e.getServiceKey(), e);
return;
} catch (Exception e) {
reporter.serviceFailure(MonitoredServiceKey.EDQS, e);
return;
}
}
reporter.reportLatencies();
reporter.serviceIsOk(MonitoredServiceKey.GENERAL);
log.debug("Finished {}", getName());
} catch (ServiceFailureException e) {
reporter.serviceFailure(e.getServiceKey(), e);

Loading…
Cancel
Save