Creates missing system images from application/src/main/data/resources/images
during LTS patch startup, mirroring the upgrade-path loadSystemResources logic.
Existing system images in the DB are left untouched.
Two related fixes for a stuck-open 'Monitoring' entry in the incident header:
1. serviceIsOk(GENERAL) is now called when a monitoring cycle completes
successfully. Previously GENERAL could only accumulate failures (via the
outer Throwable catch), with no complementary recovery, so once the
catch-all fired the service stayed red forever.
2. checkEdqs() is now wrapped in its own try/catch that reports any
non-ServiceFailureException failures under EDQS rather than GENERAL.
Connection/read timeouts hitting /api/entitiesQuery/find previously
propagated unwrapped and were bucketed as GENERAL, which hid the fact
that EDQS was the failing component.
The resolution countdown now only starts once every failing service has
recovered. While any service is still firing the timer is cancelled, so
the incident cannot auto-resolve and spawn a new incident for the same
ongoing failure between alerts.
High latency is a warning signal (no explicit recovery event) and
therefore does not block resolution — only FAILING services do.
Root pom.xml wired <skipAssembly>${pkg.skip.zip}</skipAssembly> at the
plugin-level <configuration> of maven-assembly-plugin inside the
always-active `packaging` profile's <pluginManagement>. Maven merges
plugin-level <configuration> into every execution of that plugin, so
-Dpkg.skip.zip=true (and the -Dpkg.skip=true alias that activates it)
suppressed any maven-assembly-plugin execution across the reactor -
not only the intended Windows ZIP execution.
In CE lts-4.2/4.3 this is latent (no CE module declares a non-ZIP
assembly execution), but it breaks downstream forks that do. PE's
rule-node-twilio-sms, for instance, declares a custom make-assembly
execution producing the classified -rule-node.jar consumed by
application's copy-pe-rule-nodes step; under -Dpkg.skip.zip=true that
assembly silently became a no-op and the downstream build failed to
resolve the classified artifact.
tools/pom.xml already sidesteps this via `combine.self="override"` on
its own <pluginManagement> - earlier evidence that the placement was
fragile.
Move <skipAssembly> into the `assembly` execution's own <configuration>
so it scopes only to the Windows ZIP execution.
Verified via mvn help:effective-pom on application/: with the fix,
<skipAssembly>true</skipAssembly> no longer appears at plugin-level
<configuration>, only inside the `assembly` <execution>.
Several testSaveProtoDeviceProfileWithInvalidRpcRequestSchema* tests
intermittently fail with:
org.thingsboard.server.dao.exception.TenantNotFoundException: Tenant
with id <fresh-tenant-uuid> not found
when the tenant created in @Before has not yet been populated in the
tenant profile cache by the time the request hits the partition-lookup
path (DefaultTenantRoutingInfoService -> TbTenantProfileCache ->
TenantService#findTenantById). The underlying request is idempotent
(the schema is invalid so it is rejected with 400 regardless of
retries), so wrap the doPost + status assertion in Awaitility with
Mockito.reset inside the retry block: only the last attempt's
invocations are visible to the subsequent verify* assertions.
Applies to all testSaveDeviceProfileWithInvalidRpcRequestProtoSchema
callers, including the currently-muted
testSaveProtoDeviceProfileWithInvalidRpcRequestSchemaRequestIdDateType.
The test asserts exactly 2 UserCredentialsUpdateMsg after creating a new
tenant-admin user, but the user activation flow can emit either 2 or 3
depending on timing:
- activateUserCredentials publishes CREDENTIALS_UPDATED (msg #1)
- setUserCredentialsEnabled publishes CREDENTIALS_UPDATED (msg #2)
- the initial USER ADDED edge event is processed asynchronously in
UserEdgeProcessor and bundles an extra UserCredentialsUpdateMsg when
it finds userCredentials.isEnabled() == true (i.e. activation
already raced past the ADDED event)
When the race goes the second way we end up with 1 UserUpdateMsg plus
3 UserCredentialsUpdateMsg, which currently fails the hard-coded
assertEquals(2, ...) assertion.
Accept both 2 and 3 UserCredentialsUpdateMsg instead of asserting an
exact count, matching the reality of the asynchronous edge event
pipeline.
Await cached resource data to become available after save eviction
before asserting, and await null after deletion. Prevents Mockito
verifyNoMoreInteractions(resourceService) failure caused by racing
background cache-load invocations.
Backport of 99334ba7fe from master.
Spring wraps Apache HttpClient connect failures as 'I/O error on POST request:
Connect to <URL> failed: <reason>'. Same treatment as the RestClient variant:
the URL is embedded in the 'request' word and the 'Connect to ... failed'
restatement is dropped from the visible text.
Recovered services are now shown as ':large_green_circle: <name> (<count>)'
with the last failure count that was observed before the service recovered,
matching the red-circle format.
Transforms 'request for "<URL>"' into Slack mrkdwn <URL|request>, so the
failure line renders the word 'request' as a clickable link instead of
showing the raw URL in quotes.
Build hygiene for developers who rebuild lts-4.2 frequently: cleans the
pom.xml sources that generate noise without any code change.
- Pin maven-clean-plugin to 3.5.0 (latest stable) via a
<maven-clean-plugin.version> property, matching the convention already
used for surefire/install/deploy/jar plugins. Removes 55 "version is
missing" warnings plus the cascading "Some problems were encountered
while building the effective model" messages for every child module.
- Extend license-maven-plugin excludes for files that never carry a
license header: **/lombok.config, **/eslint.config.mjs,
**/config.monitoring, **/valkey-certs/**, **/data/certs/**, **/*.otf.
Directory-scoped patterns are used instead of broad extension globs
(**/*.crt, **/*.key, **/*.pem) so a stray cert dropped outside these
directories still raises a warning.
- Exclude sjk-jfr5 / sjk-jfr6 / sjk-nps transitive deps from cassandra-all
in tools/pom.xml. Their published POMs declare system-scope deps against
unresolved ${jmc5.path}, ${jmc6.path}, ${visualvm.path} properties,
producing 7 ERROR-level lines on every build. No ThingsBoard code imports
sjk, jmc, or netbeans profiler classes.
Net impact: 1040 -> 843 WARNING lines, 7 -> 0 ERROR lines. Build still
green. Full categorization of remaining warnings and Tier 2/3 migration
plan is tracked in issue #15481.
- postgres: 16.6 -> 18 (dao sql-test.properties / nosql-test.properties)
- timescaledb: latest-pg12 -> latest-pg18 (dao timescale-test.properties)
TimescaleDB pg15+ images crash on cgroup v2 CI hosts because
/docker-entrypoint-initdb.d/001_timescaledb_tune.sh evaluates
[ ${TS_TUNE_MEMORY} -gt ${FREE_BYTES} ] with an empty left operand
after the kernel reports the 64-bit max for /sys/fs/cgroup/memory.max.
Work around the upstream bug by setting NO_TS_TUNE=true.
The Testcontainers JDBC URL (jdbc:tc:timescaledb:...) does not support
docker env vars, so register a custom JdbcDatabaseContainerProvider
(TbTimescaleDBContainerProvider, activated via jdbc:tc:tbtimescaledb:...)
that starts a PostgreSQLContainer backed by timescale/timescaledb with
NO_TS_TUNE=true.
Production docker-compose files and tb-postgres image are untouched.
Changelog review (v1.40..v1.48): no breaking changes to the slack-api-client
APIs used in this codebase. The only breaking changes in v1.45.0 were in
slack-app-backend (servlet classes) and slack-api-bolt-aws-lambda-s3-storage
(AWS SDK v1→v2), neither of which we depend on. Transitive deps (okhttp,
gson, okio) are unchanged.
Verified by compiling monitoring + application against 1.48.0 and running
the monitoring unit tests plus the application Notification/Slack test
suites (15 + 54 tests, all pass).
Adds Slack-API-based incident grouping for the monitoring microservice:
alerts that fire within resolution_timeout_s are threaded under a single
"Incident" message whose header tracks affected services in real time, and
the incident auto-resolves after a quiet period with a final summary.
Highlights
- Dual Slack modes: when bot_token + channel_id are set, alerts go through
chat.postMessage / chat.update with threaded replies; otherwise the
existing webhook path is used unchanged.
- Incident header shows 🔴 failing / :large_yellow_circle: high
latency / :large_green_circle: recovered services with live failure
counts and elapsed duration (updated every minute).
- Notifications carry structured affected-service data
(AffectedService { name, status, failureCount }) so the incident layer no
longer parses formatted alert text with regex.
- IncidentManager is decoupled from Slack via a small IncidentTransport
interface; SlackIncidentTransport adapts SlackApiClient.
- PE/other service-key types can plug in a friendly name via the
ShortNameProvider interface; TransportInfo implements it.
- Config lives under monitoring.notifications.incident.* (enabled,
resolution_timeout_s, tag_channel). Slack bot config stays under
monitoring.notifications.slack.{bot_token,channel_id}. YAML defaults are
authoritative; Spring @Value no longer carries a conflicting fallback.
- Concurrency: state and transport I/O run under the manager's monitor.
Slack client has explicit 5s call timeouts (set on SlackConfig) so the
hold time is bounded. Slack client is closed on PreDestroy.
- HTTP failure text is sanitised: HTML response bodies are stripped so
Nginx-style error pages don't flood alerts.
- BaseMonitoringService splits login / WS connect / WS subscribe into
distinct MonitoredServiceKey entries, uses catch(Exception) instead of
catch(Throwable), and wraps WsClient in try-with-resources.
- Unit tests cover incident lifecycle, status transitions, duration
formatting, HTML body stripping, and the ShortNameProvider dispatch.
TbRestApiCallNodeTest ran concurrently with SsrfSafeAddressResolverGroupTest,
which toggles the static SsrfProtectionValidator.enabled flag in its
setUp/tearDown. When the flag leaked into the REST test's async HTTP calls,
'localhost' was rejected by SSRF and extra tellFailure invocations broke the
Mockito verify count.
TbHttpClientTest and SsrfSafeAddressResolverGroupTest already declare
@ResourceLock("SsrfProtectionValidator"); apply the same lock to
TbRestApiCallNodeTest so all three SSRF-sensitive tests serialize.
Fixes#15453
Prevents UnrecognizedPropertyException during rolling upgrades when a
newer node writes a cached entity with an added field and an older node
reads it back. The Redis-backed TbJsonRedisSerializer now uses
JacksonUtil.IGNORE_UNKNOWN_PROPERTIES_JSON_MAPPER instead of the strict
OBJECT_MAPPER used by JacksonUtil.fromBytes.