Synadia Insights

Audit Checks

Insights runs over 100 automated audit checks on every epoch to evaluate the health and operational state of your entire NATS deployment. Each check produces findings with a severity level and remediation guidance, so you get something actionable instead of raw metrics.

For the full catalogue of checks (codes, severities, thresholds, and remediation) see the Audit Checks Reference.

Checks come in two flavors:

  • Operational checks evaluate the current state at each epoch. They drive the entity health indicator in the UI.
  • Optimization checks analyze trends across a user-selected time range. They surface waste and imbalance and run on demand. They do not affect the health indicator.

How Checks Work

After each epoch is indexed, Insights runs every check against the current state of the system. A check looks at specific entities (servers, clusters, streams, consumers, accounts, connections) and produces a finding when it spots something worth your attention.

Each finding includes:

  • Severity. How urgent the condition is.
  • Affected entity. The specific server, stream, consumer, or other entity involved.
  • Description. What was detected.
  • Remediation. How to investigate and fix it.

Findings show up on entity detail pages in the web UI, and they're aggregated on the overview page for a system-wide summary.

Severity Levels

SeverityMeaning
CriticalImmediate attention required. The system is experiencing or is about to experience an outage, data loss, or significant degradation.
WarningAction should be taken soon. The condition may lead to problems if left unaddressed, or it indicates a configuration that does not follow best practices.
InfoInformational finding. No immediate action is required, but the condition is worth noting for optimization or awareness.

Check Categories

Checks are grouped into six categories, each covering a distinct operational concern. Every check belongs to exactly one category.

Health and Availability

Is the system up and reachable?

These checks watch the basic operational state of the NATS deployment. They catch server restarts, crash loops, offline Raft replicas, quorum loss, route disconnections, gateway failures, and service availability. They also flag idle servers and accounts that might be decommissioned infrastructure.

Performance and Throughput

Is the system fast enough?

These checks look at latency and throughput across the system. They catch high CPU usage, slow consumer disconnections, elevated round-trip times on routes, gateways, and client connections, JetStream API pressure, pending message buildup, and bad placement of stream leaders and consumers relative to their clients.

Error and Failure Patterns

What is failing, and how?

These checks find recurring errors and failure modes. They catch connection churn spikes, consumer churn, leader flapping, slow consumer evictions, high redelivery rates, ack pending saturation, JetStream API errors, and subscription churn. They also flag security concerns like bearer token usage and too many connections per user.

Resource Saturation

Are resources running out?

These checks track resource consumption against limits. They catch JetStream memory and storage pressure, connection counts approaching maximums, memory usage outliers, high HA asset counts that degrade Raft performance, approaching stream limits (messages, bytes, consumers), account-level limit pressure, and uneven distribution of leaders, replicas, connections, storage, and subscriptions across cluster members.

Configuration Consistency

Is the data correct and coherent?

These checks verify the system is configured correctly and consistently. They catch stream and consumer replica lag, version mismatches across servers and services, naming issues (whitespace in cluster or domain names), gateway configuration mismatches, even-numbered meta cluster sizes (which weaken quorum), orphaned exports, imports without subscription interest, over-replicated inactive streams, wasted JetStream reservations, streams without retention limits, disabled compression, and unlimited JetStream accounts.

Change Detection

What changed recently?

These checks surface recent changes to the deployment. They catch server config reloads, JetStream domain changes, accounts appearing or disappearing, stream config changes (replicas, retention, limits), and server restarts. Change detection is how you tie incidents back to deployment events.

Next Steps