Synadia Insights
Audit Checks Reference
This document is the canonical reference for all automated audit checks. Each check runs against collected monitoring data to surface operational issues and optimization opportunities.
See Audit Checks for the conceptual overview, severity levels, and the operational vs optimization distinction.
The health dot in the UI reflects only operational findings. Any critical finding turns it red (or darker red for 5+); otherwise the color scales with finding count from green through yellow and orange.
Operational Checks
Server
SERVER_001 — Connection Readiness Failure
critical · Health
Flags servers reporting connection readiness failures via the healthz endpoint.
Remediation. Server not ready for connections. Check listener port conflicts, TLS certificate errors, and permission issues. Restart only after identifying the root cause.
SERVER_002 — Server Version Mismatch
warning · Consistency
Identifies servers running a different software version than the cluster majority.
Remediation. Complete the rolling upgrade so all servers run the same version. If intentional, verify compatibility between versions.
SERVER_003 — High CPU Usage
warning · Performance
Flags servers where per-core CPU usage meets or exceeds the threshold.
Remediation. Identify the workload driving CPU usage using the /pprof/profile debug endpoint. Common causes: high-fanout subjects, subscription matching overhead, JetStream write pressure. Scale out with additional servers, optimize subject hierarchies, or increase server capacity.
SERVER_004 — Slow Consumers
critical · Performance
Flags servers with new slow consumer events since the previous epoch.
Remediation. Identify affected clients and increase their message processing throughput. Slow consumer eviction triggers when the server's outbound write to a client times out due to buffer saturation. Increase max_pending in the server config (default 64 MiB) to buffer more, add consumer instances, reduce message rates, or have clients subscribe to fewer subjects.
SERVER_005 — JetStream Memory Pressure
warning · Saturation
Flags servers where JetStream memory usage is at or above 90% of reserved.
Remediation. Reduce memory-backed stream usage, increase max_mem reservation, or convert large memory-backed streams to file-backed storage.
SERVER_006 — JetStream Domain Whitespace
warning · Consistency
Flags servers whose JetStream domain name contains whitespace characters.
Remediation. Remove whitespace from the JetStream domain in the server configuration and reload.
SERVER_007 — Authentication Not Required
critical · Health
Flags servers that do not require client authentication.
Remediation. Enable authentication in the server configuration. Options: NKey authentication, JWT-based operator mode with signed credentials, user/password, TLS certificate mapping, or auth callout for centralized policy enforcement. Avoid no_auth_user in production.
SERVER_008 — Unexpected Server Restart
critical · Change
Detects servers that restarted without an accompanying version upgrade. Compares start times across consecutive epochs and excludes restarts where the server version changed (planned upgrade).
Remediation. This restart was not preceded by a version upgrade. Check server logs for the shutdown reason. Common causes: OOM kill, hardware failure, process crash, or external signal. Review system logs (journalctl, dmesg) and NATS server logs for the exit reason.
SERVER_010 — High Route RTT
warning · Performance
Flags route connections with round-trip time exceeding the threshold.
Remediation. Investigate network latency between cluster peers. Routes are intra-cluster. Check for network congestion, firewall inspection overhead, or co-locate servers in the same region.
SERVER_011 — Connection Count High
warning · Saturation
Flags servers where active connections approach the configured maximum.
Remediation. Increase max_connections in the server config or per-account limits. Distribute clients across more servers using DNS round-robin or a load balancer. Identify and close unnecessary connections.
SERVER_012 — Stale Connections
warning · Errors
Flags servers with new stale connection events since the previous epoch.
Remediation. PING keepalive timeout detected. The server sent PING but the client did not respond with PONG within the configured interval (default: 2 minutes with 3 pings). Common causes: network partition or firewall dropping idle connections, client process crashed without closing the TCP connection, or client event loop blocked and unable to process PINGs. Check firewall idle-timeout settings and client health.
SERVER_013 — Stalled Clients
warning · Performance
Flags servers with new stalled client events since the previous epoch.
Remediation. Stalled clients indicate fast producers blocked by downstream backpressure. The server is throttling the producer because a consumer cannot keep up. Improve consumer processing speed, add consumer instances, or increase max_pending to buffer more. This is normal flow control and an early indicator before slow consumer eviction.
SERVER_014 — JetStream Subsystem Unhealthy
critical · Health
Flags servers with JETSTREAM-type healthz errors.
Remediation. JetStream subsystem unhealthy. No contact with meta leader, not current, or still recovering. Check cluster connectivity and meta group status via nats server report jetstream.
SERVER_015 — Stream Recovery Failure
critical · Consistency
Flags servers with STREAM or CONSUMER-type healthz errors.
Remediation. JetStream stream or consumer could not be recovered from disk. Check for corrupt store files. Use nats stream report to identify affected streams.
SERVER_016 — Account Resolution Failure
warning · Consistency
Flags servers with ACCOUNT-type healthz errors.
Remediation. JetStream account could not be resolved. Verify account JWT configuration and resolver connectivity.
SERVER_017 — JetStream Storage Pressure
warning · Saturation
Flags servers where JetStream storage usage is at or above 90% of reserved.
Remediation. Add retention limits (max_age, max_bytes) to streams consuming the most space, purge stale data, increase max_store reservation, or add disk capacity.
SERVER_018 — High Gateway RTT
warning · Performance
Flags gateway connections with round-trip time exceeding the threshold.
Remediation. Investigate network latency between clusters. Gateways are inter-cluster. High RTT may be expected for cross-region deployments. Consider stream placement closer to consumers to reduce gateway dependency.
SERVER_019 — JetStream Storage vs Configured Limit
warning · Saturation
Flags servers where JetStream storage usage approaches the configured max_store limit, which when exceeded causes Raft failures.
Remediation. Actual JetStream storage usage is approaching the configured max_store limit. Unlike js_reserved_storage (sum of stream reservations), max_store is the hard filesystem limit. When usage exceeds max_store, Raft WAL writes fail and streams become unavailable. Increase max_store in the server config, add disk capacity, or reduce stream storage by purging data or adding retention limits.
Cluster
CLUSTER_001 — Memory Usage Outlier
warning · Saturation
Flags servers whose memory usage exceeds the configured multiplier of their cluster average.
Remediation. Investigate what is consuming memory on the outlier server. Common causes include large Raft state, many subscriptions, or memory-backed streams. Consider rebalancing workload.
CLUSTER_003 — High HA Assets
warning · Saturation
Flags servers with 1000 or more highly-available JetStream assets.
Remediation. Distribute streams and consumers across more servers. Consider reducing replica counts on low-priority streams or consolidating small streams.
CLUSTER_004 — Cluster Name Whitespace
warning · Consistency
Flags servers whose cluster name contains whitespace characters.
Remediation. Remove whitespace from the cluster name in the server configuration and restart the affected servers.
CLUSTER_005 — Route Count Low
warning · Health
Flags servers with fewer cluster routes than expected based on cluster size.
Remediation. Verify network connectivity between cluster peers on the cluster port. Ensure all servers have matching cluster names and correct route URLs (full mesh requires each server to list at least one other). Check firewall rules and DNS resolution for route hostnames. Expected route count is N-1 for a full-mesh cluster of N servers.
CLUSTER_006 — Connection Count Change
warning · Errors
Flags servers where the connection count changed dramatically between epochs, indicating a significant increase or decrease in connected clients.
Remediation. A dramatic connection count change was detected. Investigate in order: misconfigured client reconnect backoff (default is randomized jitter, not zero), authentication failures causing immediate disconnect after connect, slow consumer evictions triggering reconnect loops, load balancer health checks creating ephemeral connections, or network instability. Review client connection URLs and reconnect settings.
CLUSTER_007 — Gateway Disconnection
critical · Health
Flags servers that lost a gateway connection since the previous epoch.
Remediation. Gateway connection to a remote cluster was lost. Gateways auto-reconnect with randomized jitter, so transient network issues resolve automatically. If the disconnection persists: check TLS certificate validity (including OCSP stapling if enabled. Stale OCSP responses are a common cause), verify firewall rules between clusters allow the gateway port, and confirm gateway names are consistent across all clusters.
CLUSTER_008 — Gateway Config Mismatch
warning · Consistency
Flags servers whose set of gateway connections differs from the cluster majority.
Remediation. Ensure all servers in the cluster have identical gateway configuration. Asymmetric gateways cause routing failures. If reject_unknown_cluster is enabled, unlisted gateways are rejected. When disabled (default), gateways can be discovered implicitly via gossip. Verify all intended gateways are listed and TLS configuration matches on both sides.
Account
ACCOUNTS_001 — Account Connection Limit
warning · Saturation
Flags accounts where connections are at or above 90% of the configured limit.
Remediation. Increase the account connection limit in the JWT or server config. Distribute clients across more servers or use separate credentials per service instance.
ACCOUNTS_002 — Slow Consumers
critical · Errors
Flags accounts with new slow consumer events since the previous epoch, aggregated across servers.
Remediation. Identify the affected consumers within this account and increase their processing throughput. Consider adding more consumer instances or reducing message rates.
ACCOUNTS_003 — Inactive JWT Import
critical · Consistency
Detects imports declared in the account JWT but not activated by the server. Diagnoses root cause: missing activation token, expired token, token signed by rotated signing key, or source export not found.
Remediation. JWT import is declared but not activated by the server. Follow the action indicated by the diagnostic detail column: 'Missing activation token'. Request an activation token from the exporting account operator. 'Expired activation token'. Renew the token with the exporting account. 'Activation token signed by rotated signing key'. Re-issue the token with the current signing key. 'Source export not found'. Verify the export exists in the exporting account JWT.
ACCOUNTS_004 — Orphaned Export
warning · Consistency
Flags exports with no matching importer in any account. Uses NATS wildcard subject matching.
Remediation. Remove the unused export from the account JWT, or create the intended import in the consuming account.
ACCOUNTS_005 — No Subscription Interest
info · Consistency
Finds active imports where no client in the importing account subscribes to the imported subject. Uses NATS wildcard subject matching.
Remediation. Verify clients are subscribing to the correct subject. Remove the import if it is no longer needed.
ACCOUNTS_006 — Account Subscription Limit
warning · Saturation
Flags accounts where subscriptions are at or above 90% of the configured limit.
Remediation. Increase the account subscription limit in the JWT or server config. Consolidate clients that subscribe to overlapping subjects or remove unused subscriptions.
JetStream
JetStream checks apply to streams, KV stores, and object stores. Each stream-scoped check result carries an entity_type value (stream, kvstore, or objectstore) indicating which kind of asset the finding relates to.
JETSTREAM_001 — Stream Replica Lag
warning · Consistency
Flags stream replicas whose last sequence number is more than 10% behind the leader.
Remediation. Check for resource contention (CPU, disk I/O, network bandwidth) on the lagging replica's server. Replicas catch up automatically through Raft replication. If auto-catchup stalls, use nats stream cluster peer-remove followed by re-adding to force a full snapshot sync. But only as a last resort.
JETSTREAM_002 — High Subject Cardinality
warning · Saturation
Flags streams with one million or more unique subjects.
Remediation. Review the subject naming scheme. Consider partitioning high-cardinality data across multiple streams or using a flatter subject hierarchy.
JETSTREAM_003 — Stream Message Limit
warning · Saturation
Flags streams where message count is at or above 90% of the limit.
Remediation. Add a retention policy (max_age) to expire old messages, increase max_msgs, or set max_msgs_per_subject to distribute limits across subjects.
JETSTREAM_004 — JS API Request Rate High
warning · Performance
Flags when the JetStream API request rate exceeds the threshold.
Remediation. Identify the source of API requests. Common causes: rapid stream/consumer creation, excessive info lookups, or tight-loop API calls. Reduce concurrent API call volume. The server queues requests internally and publishes an advisory when the queue saturates. Cache JetStream info responses client-side where possible.
JETSTREAM_005 — JS API Pending High
warning · Performance
Flags servers where JetStream API inflight requests exceed the threshold.
Remediation. Reduce the rate of concurrent JetStream API calls. Check for clients making synchronous API calls in tight loops.
JETSTREAM_006 — Consumer Count Change
warning · Errors
Flags when the total consumer count change between epochs exceeds the threshold, indicating a significant increase or decrease.
Remediation. Identify what is creating and destroying consumers rapidly. Use durable consumers to avoid recreation. For ephemeral consumers, set inactive_threshold (default 5s) to a longer duration if consumers are being deleted prematurely due to brief inactivity gaps.
JETSTREAM_007 — JetStream Memory Utilization Critical
critical · Saturation
Flags servers where JetStream memory usage exceeds the critical threshold.
Remediation. Immediately reduce memory-backed stream usage. Convert streams to file-backed storage or reduce max_mem. When memory is exhausted, new stream writes are rejected.
JETSTREAM_008 — Stream Quorum Lost
critical · Health
Flags replicated streams where enough replicas are offline to lose quorum.
Remediation. Restore offline replicas by bringing their servers back online. If servers are permanently lost, remove failed peers via nats stream cluster peer-remove to lower the quorum requirement, then the remaining peers can elect a leader.
JETSTREAM_009 — JS API Error Rate High
warning · Errors
Flags servers where JetStream API errors exceed a percentage of total requests.
Remediation. JetStream API error rate exceeded the threshold on this server. Check server logs for specific API error categories. Common types include permission denials (403), stream/consumer not found (404), and resource exhaustion (503). High error rates often correlate with client misconfiguration (wrong stream names, insufficient permissions) rather than server issues.
JETSTREAM_010 — Stream Byte Limit
warning · Saturation
Flags streams where byte usage is at or above 90% of the limit.
Remediation. Purge stale data, increase max_bytes, enable S2 compression to reduce on-disk size, or add max_age to expire old messages.
JETSTREAM_011 — Stream Consumer Limit
warning · Saturation
Flags streams where consumer count is at or above 90% of the limit.
Remediation. Remove unused or inactive consumers, increase max_consumers, or consolidate consumers that read overlapping subject filters.
JETSTREAM_012 — JetStream Storage Utilization Critical
critical · Saturation
Flags servers where JetStream storage usage exceeds the critical threshold.
Remediation. Immediately free storage by purging or deleting low-priority streams. When storage is exhausted, stream writes fail with I/O errors. Increase max_store or add disk capacity. Set max_bytes on all streams.
JETSTREAM_013 — Stream Subject/Message Count Inconsistency
warning · Consistency
Flags streams where the number of unique subjects exceeds the total message count. An invariant violation indicating filestore corruption.
Remediation. Stream reports more unique subjects than total messages. This is an invariant violation indicating filestore accounting corruption. This condition persists across server restarts. Contact support with the stream details for guidance on recovery. Backing up and recreating the stream may be necessary.
JETSTREAM_014 — Stream Replica Message Count Divergence
critical · Consistency
Flags replicated streams where all replicas report current but have significantly different message counts, indicating filestore corruption or raft state reset.
Remediation. Stream replicas report divergent message counts despite all being current in Raft. This indicates filestore corruption, raft state reset, or interest-based retention sync failure. Compare replica states with nats stream info --all. The replica with the lowest count likely lost data. For interest-based retention streams, consumer ack propagation may have failed. Check consumer states across replicas.
JETSTREAM_015 — Mirror Last Seen Staleness
warning · Consistency
Flags mirror streams where the mirror consumer has stalled. Zero lag but no activity while the source stream continues receiving messages.
Remediation. Mirror stream shows zero lag but hasn't received activity in over 5 minutes while the source stream continues receiving messages. The internal mirror consumer has likely stalled. Perform a leader step-down on the mirror stream to force recreation of the mirror consumer: nats stream cluster step-down STREAM_NAME.
JETSTREAM_016 — JetStream Storage vs Configured Limit Critical
critical · Saturation
Flags servers where JetStream storage usage critically exceeds the configured max_store limit, risking imminent Raft failures.
Remediation. JetStream storage usage has reached critical levels relative to the configured max_store limit. Raft WAL writes will fail when storage exceeds this limit, causing streams to lose quorum and become unavailable. Immediately free storage by purging low-priority streams, increase max_store, or add disk capacity.
JETSTREAM_017 — Mirror Lag Critical
critical · Consistency
Flags mirror streams where mirror lag exceeds the operator-defined io.nats.monitor.lag-critical threshold.
Remediation. The mirror stream is falling behind the source by more than the operator threshold. Check network connectivity to the mirror source and resource contention on the mirror server.
JETSTREAM_018 — Mirror Seen Critical
critical · Consistency
Flags mirror streams where the time since the mirror was last active exceeds the operator-defined io.nats.monitor.seen-critical threshold.
Remediation. The mirror stream has not received data from the source within the operator-defined window. Verify the source stream is active and network connectivity is healthy.
JETSTREAM_019 — Min Sources
critical · Health
Flags streams where the source count is below the operator-defined io.nats.monitor.min-sources threshold.
Remediation. The stream has fewer sources than the operator-defined minimum. Verify that all expected source streams exist and are configured correctly.
JETSTREAM_020 — Max Sources
critical · Health
Flags streams where the source count exceeds the operator-defined io.nats.monitor.max-sources threshold.
Remediation. The stream has more sources than the operator-defined maximum. Remove unexpected sources or update the threshold.
JETSTREAM_021 — Peer Expect
critical · Health
Flags streams where the actual peer count does not match the operator-defined io.nats.monitor.peer-expect threshold.
Remediation. The stream's actual peer count does not match the operator expectation. Check for offline replicas or verify the num_replicas configuration.
JETSTREAM_022 — Peer Lag Critical
critical · Consistency
Flags stream replicas where lag exceeds the operator-defined io.nats.monitor.peer-lag-critical threshold.
Remediation. A stream replica is lagging behind the leader by more than the operator threshold. Check for resource contention on the replica server.
JETSTREAM_023 — Peer Seen Critical
critical · Consistency
Flags stream replicas where the time since the replica was last active exceeds the operator-defined io.nats.monitor.peer-seen-critical threshold.
Remediation. A stream replica has not been active within the operator-defined window. The replica may be offline or experiencing network issues.
JETSTREAM_024 — Message Count Threshold
warning/critical · Saturation
Flags streams where message count exceeds operator-defined thresholds. Direction is inferred from threshold ordering.
Remediation. The stream message count has exceeded the operator-defined threshold. If too many: add retention policies or increase limits. If too few: investigate upstream publishers.
JETSTREAM_025 — Subject Count Threshold
warning/critical · Saturation
Flags streams where subject count exceeds operator-defined thresholds. Direction is inferred from threshold ordering.
Remediation. The stream subject count has exceeded the operator-defined threshold. Review subject naming scheme and consider partitioning high-cardinality data.
Meta Cluster
META_001 — Offline Replica
critical · Health
Flags meta cluster replicas that are reported as offline.
Remediation. Bring the offline server back online. Check server logs and network connectivity. If the server is permanently lost, remove it via nats server cluster peer-remove. This requires the remaining members to have quorum.
META_002 — Leader Disagreement
critical · Health
Flags when multiple servers report themselves as the meta cluster leader.
Remediation. This indicates a transient state during a leader election, not a split-brain (Raft quorum rules prevent split-brain). Multiple servers may briefly report leadership during term transitions. If persistent, check network connectivity between cluster peers and ensure no asymmetric partition exists. Restart the lower-term leader to force convergence.
META_003 — Meta Leader Flapping
warning · Errors
Flags when the meta cluster leader has changed more than the allowed number of times in the recent time window.
Remediation. Meta cluster leader changed more than the allowed number of times in the time window. Raft heartbeats are sent every 1 second and the election timeout is 4-9 seconds. Any disruption longer than 4 seconds triggers a new election. Investigate: network instability between meta cluster peers (even brief packet loss can trigger elections), CPU saturation delaying heartbeat processing, or disk I/O stalls blocking Raft WAL writes. Use nats server report jetstream to see current meta state.
META_004 — Meta Snapshot Slow
warning/critical · Performance
Flags when the meta cluster snapshot duration exceeds the warning or critical threshold.
Remediation. Reduce the number of JetStream assets (streams, consumers) to shrink snapshot size. Check disk I/O performance on meta cluster servers.
META_005 — Meta State Growth
warning · Saturation
Flags when the total number of JetStream asset replicas exceeds the threshold.
Remediation. Remove unused streams and consumers. Consider reducing replica counts or consolidating small streams to reduce the total Raft group count.
META_006 — Meta Quorum Lost
critical · Health
Flags when enough meta cluster peers are offline to lose quorum.
Remediation. Immediately restore offline meta cluster servers. Without quorum, all JetStream API operations are stalled cluster-wide. If servers are permanently lost, stop all remaining meta servers, remove the failed peer's Raft WAL state, and restart to re-bootstrap the meta group. Quorum loss detection occurs within 10 seconds.
META_007 — Even Cluster Size
warning · Consistency
Flags when the meta cluster has an even number of peers.
Remediation. Consider adding or removing a server to make the meta cluster an odd size. Even-numbered clusters have the same quorum requirements as odd (quorum = N/2 + 1), but an odd cluster tolerates one more failure for the same number of servers (e.g., 3 nodes tolerate 1 failure; 4 nodes also tolerate only 1 failure).
META_008 — Meta Pending High
warning · Performance
Flags when the meta cluster leader has a high number of pending Raft operations.
Remediation. High pending operations indicate the meta group is falling behind on consensus. Check server CPU, disk I/O, and network latency on the meta leader. Consider reducing JetStream API request rate.
META_009 — Meta Cluster Size Decreased
critical · Health
Flags when the meta cluster size has decreased between consecutive epochs, indicating a peer was removed or lost.
Remediation. Meta cluster size decreased. A peer was removed or lost. If intentional (planned decommission via nats server cluster peer-remove), verify the remaining cluster has an odd number of peers for optimal quorum. If unintentional, immediately investigate the lost peer. Check server logs, network connectivity, and disk health. A shrinking meta cluster reduces fault tolerance.
Service
SERVICE_001 — Service Version Mismatch
warning · Consistency
Flags services where instances report different client versions or languages.
Remediation. Complete the rolling deployment so all instances run the same version. If intentional, verify compatibility between versions.
SERVICE_002 — Service Down
critical · Health
Flags services that had instances in the previous epoch but zero in the current epoch.
Remediation. Restart the service instances. Check application logs and orchestration platform (Kubernetes, systemd) for failure reasons.
Leafnode
LEAF_001 — Leafnode Name Whitespace
warning · Consistency
Flags leafnode connections whose remote server name contains whitespace.
Remediation. Remove whitespace from the server name in the leafnode's configuration and restart it.
LEAF_002 — High Leaf RTT
warning · Performance
Flags leafnode connections with round-trip time exceeding the threshold.
Remediation. Investigate network latency between the leaf server and the hub. Consider co-locating leaf servers closer to hubs or checking for network congestion.
LEAF_003 — Leafnode Subscription Count High
warning/critical · Saturation
Flags leafnode connections carrying a large number of subscriptions, which can cause hub processing to exceed the stale connection timeout.
Remediation. Leafnode carries a high subscription count which increases hub processing time during connection establishment. If processing exceeds the 2-second stale connection timeout, the connection will be dropped and retried in a loop. Reduce the number of subscriptions propagated across the leafnode. Use explicit exports/imports instead of wildcard subscriptions, or consolidate subscriber applications.
Connection
CONN_001 — High Client RTT
warning · Performance
Flags client connections with round-trip time exceeding 100 ms.
Remediation. Move the client closer to the NATS server or use a leafnode to bridge the distance. Check for network congestion.
CONN_002 — Client Pending Pressure
warning · Performance
Flags client connections with more than 1 MiB of pending bytes.
Remediation. Increase the client's message processing throughput or reduce the publish rate to this subscriber. The default pending buffer is 64 MiB (max_pending in server config). This is an early warning before slow consumer disconnection.
CONN_003 — Connection Stopped
info · Errors
Flags connections that disconnected with a non-empty reason.
Remediation. Connection disconnected with a non-empty stop reason. Review the stop reason in the detail column for the specific cause. Common reasons: 'Slow Consumer - Loss' (client could not keep up with message rate), 'Authentication Failure' (invalid or expired credentials), 'Server Shutdown' (planned maintenance), 'Maximum Connections Exceeded' (server or account limit reached).
Consumer
CONSUMER_001 — Consumer Replica Offline
critical · Health
Flags consumer replicas that are reported as offline.
Remediation. Bring the offline server back online. Check server logs for the reason the replica went offline.
CONSUMER_002 — Consumer Replica Lag
warning · Consistency
Flags consumer replicas lagging by more than 1000 operations behind the leader.
Remediation. Check for resource contention on the lagging server. If persistent, consider removing and re-adding the consumer replica.
CONSUMER_003 — Consumer Quorum Lost
critical · Health
Flags replicated consumers where enough replicas are offline to lose quorum.
Remediation. Restore offline replicas by bringing their servers back online. Without quorum the consumer cannot make progress.
CONSUMER_004 — Consumer Delivered Below Stream First Sequence
critical · Consistency
Flags consumers whose last delivered position is below the stream's first sequence after a purge or truncation.
Remediation. The consumer's delivered position references a sequence number that no longer exists in the stream. Likely after a stream purge or truncation. The consumer appears healthy but silently misses all new messages. Delete and recreate the consumer, or use nats consumer edit to reset its deliver policy to 'all' or 'last'.
CONSUMER_005 — Consumer Sequence Ahead of Stream Sequence
critical · Consistency
Flags consumers whose delivered position is ahead of the stream's last sequence.
Remediation. The consumer's delivered position is ahead of the stream's last sequence. The consumer is waiting for messages the stream hasn't produced yet. This can happen after stream movement across clusters, leadership transfers with data loss, or raft state resets. Delete and recreate the consumer to reset its position.
CONSUMER_006 — Outstanding Ack Critical
critical · Health
Flags consumers where num_ack_pending exceeds the operator-defined threshold.
Remediation. The consumer has more outstanding acks than the operator threshold. Increase consumer throughput, scale consumers, or raise the threshold in stream/consumer metadata.
CONSUMER_007 — Waiting Critical
critical · Health
Flags consumers where num_waiting exceeds the operator-defined threshold.
Remediation. The consumer has more waiting pull requests than the operator threshold. Add consumer instances or increase max_waiting to accommodate the load.
CONSUMER_008 — Unprocessed Critical
critical · Health
Flags consumers where num_pending exceeds the operator-defined threshold.
Remediation. The consumer has more unprocessed messages than the operator threshold. Scale consumer processing capacity or investigate consumer stalls.
CONSUMER_009 — Last Delivery Critical
critical · Health
Flags consumers where the time since the last delivery exceeds the operator-defined threshold.
Remediation. The consumer has not delivered a message within the operator-defined window. Check if the consumer is stalled, paused, or if the stream has stopped receiving messages.
CONSUMER_010 — Last Ack Critical
critical · Health
Flags consumers where the time since the last acknowledgment exceeds the operator-defined threshold.
Remediation. The consumer has not acknowledged a message within the operator-defined window. Check if downstream processing is stalled or if the consumer application is healthy.
CONSUMER_011 — Redelivery Critical
critical · Errors
Flags consumers where num_redelivered exceeds the operator-defined threshold.
Remediation. The consumer is redelivering more messages than the operator threshold. Investigate processing failures, increase ack_wait, or fix downstream errors causing nacks.
CONSUMER_012 — Pinned Consumer Policy Mismatch
critical · Consistency
Flags consumers with io.nats.monitor.pinned metadata that are not using the overflow priority policy.
Remediation. The consumer metadata indicates it should be pinned but the priority_policy is not set to overflow. Update the consumer configuration to use priority_policy=overflow.
User
USER_001 — Bearer Token User
warning · Errors
Flags bearer token users with active connections.
Remediation. Migrate the user to NKey-based authentication where possible. Bearer tokens skip nonce signature verification during CONNECT, relying on JWT validity alone. They are appropriate for WebSocket and HTTP contexts where NKey signing is impractical, but should not be used for long-lived server-to-server connections.
USER_002 — Excessive User Connections
warning · Errors
Flags users with more than 100 active connections.
Remediation. Investigate why a single user has so many connections. Consider using connection pooling or separate user credentials per service instance.
Change
CHANGE_001 — Config Reload Detected
info · Change
Detects servers whose configuration was reloaded by comparing config_load_time between consecutive epochs.
Remediation. Verify the configuration change was intentional. Review server logs to confirm the reload was successful and no errors occurred.
CHANGE_002 — JetStream Domain Changed
warning · Change
Detects servers whose JetStream domain value changed between consecutive epochs.
Remediation. Verify the domain change was intentional. JetStream domain changes can affect stream and consumer routing across clusters.
CHANGE_003 — Account Added or Removed
info · Change
Detects accounts that appeared or disappeared between consecutive epochs.
Remediation. Verify the account change was expected. For new accounts, ensure imports and exports are correctly wired. For removed accounts, confirm no dependent services remain.
CHANGE_004 — Stream Configuration Changed
info · Change
Detects streams whose configuration fields (replicas, retention, limits) changed between consecutive epochs.
Remediation. Verify the stream configuration change was intentional and monitor for downstream effects on consumers.
Optimization Checks
Placement
OPT_PLACE_001 — Cross-Cluster Stream Access
info · Performance
Flags accounts with clients in clusters that have no local stream leaders.
Remediation. Place stream replicas in clusters where clients connect, or migrate clients to clusters with existing stream leaders to reduce gateway traffic.
OPT_PLACE_002 — Consumer Leader Not Co-located
info · Performance
Flags consumers whose leader is in a different cluster than the majority of connections.
Remediation. Use preferred placement tags to co-locate consumer leaders with the majority of subscribing clients. To force a leader election, use nats consumer cluster step-down which may relocate the leader to a better-positioned replica.
OPT_PLACE_003 — High Gateway Traffic Ratio
info · Performance
Flags accounts where more than 30% of traffic is cross-cluster gateway traffic.
Remediation. Review stream and consumer placement for this account. Move workloads closer to data to reduce inter-cluster traffic.
OPT_PLACE_004 — Gateway Interest Mode
info · Performance
Flags gateway account combinations still using optimistic interest mode.
Remediation. Optimistic mode floods all messages to remote clusters until interest is learned. The server auto-transitions to interest-only mode after a subscription activity threshold is reached. If this account is stuck in optimistic mode, verify the gateway is running NATS 2.9+ (where interest-only is the default) or check for high subscription churn preventing the transition.
Cost
OPT_COST_001 — Over-Replicated Inactive Stream
info · Consistency
Flags R3+ streams with no new messages across the selected time range.
Remediation. Reduce the replica count to R1 for inactive streams, or delete the stream if it is no longer needed.
OPT_COST_002 — Memory Storage Large Stream
info · Saturation
Flags memory-backed streams using more than 100 MiB.
Remediation. Convert the stream to file-backed storage if low-latency access is not required. Memory-backed streams consume server RAM directly.
OPT_COST_003 — Wasted JetStream Memory Reservation
info · Consistency
Flags servers where JetStream memory usage is below 20% of reserved capacity.
Remediation. Reduce the JetStream memory reservation to match actual usage, or migrate memory-backed streams to this server to improve utilization.
OPT_COST_004 — Uncompressed Large Stream
info · Consistency
Flags file-backed streams exceeding 1 GiB with no compression enabled.
Remediation. Enable S2 compression on the stream configuration to reduce disk usage and I/O costs.
OPT_COST_005 — Wasted JetStream Storage Reservation
info · Consistency
Flags servers where JetStream storage usage is below 20% of reserved capacity.
Remediation. Reduce the JetStream storage reservation to match actual usage, or migrate file-backed streams to this server to improve utilization.
Balance
OPT_BALANCE_001 — Uneven Leader Distribution
info · Saturation
Flags servers hosting disproportionately many stream and consumer leaders.
Remediation. Use nats stream cluster step-down and nats consumer cluster step-down to redistribute leaders across the cluster. Target servers with the highest leader counts first.
OPT_BALANCE_002 — Connection Hotspot
info · Saturation
Flags servers with more than double the cluster average connections.
Remediation. Review client connection configuration. Use DNS round-robin or load balancer to distribute connections more evenly across cluster servers.
OPT_BALANCE_003 — Subscription Hotspot
info · Saturation
Flags servers with more than double the cluster average subscriptions.
Remediation. Redistribute client connections to balance subscription load. Check for clients with excessive subscriptions.
OPT_BALANCE_004 — Stream Replica Count Imbalance
info · Saturation
Flags servers hosting disproportionately many stream replicas.
Remediation. Use placement tags to distribute new streams more evenly. Consider removing and re-adding replicas to rebalance.
OPT_BALANCE_005 — JetStream Storage Skew
info · Saturation
Flags servers whose JetStream storage exceeds double the cluster average.
Remediation. Migrate large streams to other cluster servers or add storage capacity to balance disk usage.
OPT_BALANCE_006 — Account Connection Concentration
info · Saturation
Flags servers hosting more than 70% of an account's connections.
Remediation. Configure client connection URLs to include multiple servers. Use a load balancer to spread connections across the cluster.
OPT_BALANCE_007 — Stream-Consumer Leader Co-location
info · Saturation
Flags streams where the stream leader's server also hosts a disproportionate share of consumer leaders.
Remediation. The stream leader's server hosts more than half of the consumer leaders for this stream, concentrating I/O and CPU load on a single node. Use nats consumer cluster step-down to redistribute consumer leaders across the cluster.
OPT_BALANCE_008 — JetStream Storage Saturation with Skew
warning · Saturation
Flags servers with high JetStream storage utilization where the cluster also exhibits significant storage skew between nodes.
Remediation. Server is near JetStream storage capacity and the cluster has significant storage imbalance between nodes. Migrate streams from the saturated server to underutilized peers, or increase storage on the saturated server. Use placement tags to guide future stream placement.
Account
OPT_ACCT_001 — Account Storage Quota Approaching Limit
warning · Saturation
Flags accounts where JetStream storage reservations approach the configured quota.
Remediation. Account's JetStream storage reservations are approaching the configured quota. When the quota is reached, all new stream creates and stream writes for this account will fail. Reduce stream max_bytes reservations, delete unused streams, or increase the account's js_disk_storage limit. Note: NATS enforces quotas by reservation (max_bytes x num_replicas), not actual bytes used.
OPT_ACCT_002 — Excessive JWT Size
warning · Consistency
Flags accounts with unusually large JWT claims, indicating excessive permissions or revocations.
Remediation. Account JWT is unusually large, likely due to excessive permissions, many revocations, or a large number of signing keys. Large JWTs increase memory usage and slow account resolution on every connection. Review the account's permissions and revocations. Consolidate wildcard permissions where possible, and prune expired revocations.
Idle Resources
OPT_IDLE_001 — Underutilized Server
info · Health
Flags servers that remained nearly idle across the selected time range.
Remediation. Consider decommissioning the server or migrating workload to it from busier servers.
OPT_IDLE_002 — Inactive Stream
info · Health
Flags unsealed streams that received no new messages across the time range.
Remediation. Delete the stream if it is no longer needed, or seal it to prevent accidental writes. If temporarily inactive, no action is needed.
OPT_IDLE_003 — Inactive Consumer
info · Consistency
Flags consumers that made no delivery progress across the time range.
Remediation. Delete the consumer if it is no longer processing messages. Check whether the subscribing application is running.
OPT_IDLE_004 — Drained Consumer
info · Consistency
Flags consumers fully caught up with zero pending on an inactive stream.
Remediation. Consider deleting the consumer since its stream has no new messages and all existing messages have been processed.
OPT_IDLE_005 — Inactive Account
info · Health
Flags non-system accounts with no connections or throughput for the configured inactivity threshold (default 24h).
Remediation. Review whether the account is still needed. Remove or disable it if no longer in use.
OPT_IDLE_006 — Disconnected Users
info · Consistency
Flags non-system account users with no active connections at the current epoch.
Remediation. Verify whether the user credential is still in use. Revoke the user if no longer needed.
OPT_IDLE_007 — Idle Client Connections
info · Consistency
Flags client connections idle for more than 5 minutes with zero messages.
Remediation. Client connection has been idle with zero messages for longer than the threshold. Diagnostic steps: check the subscription count. If zero, the connection is likely leaked (connected but never subscribed). If subscriptions exist, check the client library name and version. It may be a monitoring or health-check client that connects but does not publish or subscribe to active subjects. Close leaked connections to free server resources.
System Improvement
OPT_SYS_001 — Streams Without Limits
info · Consistency
Flags streams with no message, byte, or age retention limits.
Remediation. Configure at least one retention limit (max_msgs, max_bytes, max_age, or max_msgs_per_subject) to prevent unbounded disk growth.
OPT_SYS_002 — High Consumer Redelivery
warning · Errors
Flags consumers with a redelivery rate exceeding 10%.
Remediation. Redelivery rate exceeded the threshold. Messages are being delivered multiple times to this consumer. Common causes: processing time exceeds ack_wait (default 30s), application panics before acknowledging, or incorrect ack logic (acking the wrong message). Set max_deliver to cap retry attempts and prevent infinite redelivery loops. Configure backoff for exponential retry spacing. Increase ack_wait if processing legitimately takes longer.
OPT_SYS_003 — Ack Pending Buildup
warning · Errors
Flags consumers approaching their maximum ack pending limit.
Remediation. Scale out consumer instances to process messages faster, increase max_ack_pending (default 1,000), or investigate why messages are not being acknowledged. Ack pending can also be limited at the stream level via consumer limits and at the account level.
OPT_SYS_004 — Unbound Push Consumer
warning · Errors
Flags push consumers with no subscriber currently bound.
Remediation. Start the subscribing application or convert to a pull consumer. While unbound, messages are delivered to the deliver subject with no receiver. They accumulate in ack pending and trigger redeliveries until max_deliver is reached.
OPT_SYS_005 — Route Pending Pressure
warning · Performance
Flags route connections with more than 1 MiB of pending data.
Remediation. Investigate network bandwidth between cluster peers. Reduce message rates on high-volume intra-cluster subjects or upgrade network capacity between peers.
OPT_SYS_006 — Leaf Compression Disabled
info · Consistency
Flags leaf connections with compression disabled.
Remediation. Enable S2 compression in the leafnode configuration (compression: s2_auto for adaptive compression based on RTT). Available modes: s2_fast, s2_better, s2_auto (default when enabled). Configure on both hub and leaf sides.
OPT_SYS_007 — Raft Apply Lag
warning · Performance
Flags Raft groups where committed-applied gap exceeds 100 entries.
Remediation. Check disk I/O and CPU on the affected server. The apply lag indicates the server is falling behind in processing committed Raft entries.
OPT_SYS_008 — Unlimited JetStream Account
info · Consistency
Flags non-system accounts with JetStream enabled but no storage limits.
Remediation. Set JetStream memory and disk storage limits in the account JWT to prevent a single account from exhausting cluster resources.
OPT_SYS_009 — Leaderless Raft Group
critical · Health
Raft group has no elected leader and cannot process writes.
Remediation. Investigate cluster connectivity. A leaderless group is detected within 10 seconds of quorum loss. Ensure a quorum of peers is online and reachable. Check server logs for election failures or network partition indicators.
OPT_SYS_010 — Raft IPQ Backpressure
warning · Performance
Internal queue lengths for a raft group exceed threshold, indicating processing backlog.
Remediation. High IPQ lengths indicate Raft internal queues (proposals, append entries, apply, responses) are backing up. The apply queue is most critical. It means the upper layer (JetStream) cannot consume committed entries fast enough. Check server CPU, disk I/O, and network latency.
OPT_SYS_011 — Subscription Fanout Anomaly
info · Consistency
Flags servers where max fanout is disproportionately higher than average fanout.
Remediation. Investigate subjects with high subscriber counts. A large max-to-average fanout ratio indicates one or more subjects with excessive subscribers, which can create hot spots.
OPT_SYS_012 — Subscription Churn
info · Errors
Flags servers with excessive subscription insert and remove operations since the previous epoch.
Remediation. Excessive subscription insert and remove operations detected. Two diagnostic paths: (1) If a single client is responsible, it is likely a misbehaving application that subscribes/unsubscribes in a loop. Identify it via connection name or IP and fix the client code. (2) If many clients are responsible, it is likely a reconnection storm. Clients reconnecting simultaneously re-subscribe all at once. Check for a preceding network event or server restart that triggered mass reconnection.
OPT_SYS_013 — Raft Sustained Catching Up
warning · Health
Flags Raft groups with a member in catching-up state.
Remediation. Check disk I/O, network bandwidth, and CPU on the catching-up server. If the server is persistently behind, it may need more resources or a re-sync.
OPT_SYS_014 — Gateway Pending Pressure
warning · Performance
Flags gateway connections with more than 1 MiB of pending data.
Remediation. Investigate network bandwidth between clusters. Reduce inter-cluster message rates by improving stream/consumer placement, or upgrade inter-cluster network capacity.
OPT_SYS_015 — Consumer ACK Floor Divergence
warning/critical · Errors
Flags consumers where the gap between delivered position and ACK floor is disproportionately large relative to max_ack_pending, indicating interleaved acknowledgments.
Remediation. Consumer's ACK floor is far behind its delivered position. This indicates interleaved acknowledgments where messages between the ACK floor and delivered position are tracked individually in memory. Causes include out-of-order processing, selective acking, or slow processing of specific messages. Consider using AckAll policy if ordering permits, or investigate why specific messages are not being acknowledged.
OPT_SYS_016 — Direct Gets Disabled
info · Performance
Flags streams with allow_direct disabled, forcing read operations through the Raft consensus pipeline.
Remediation. Stream has allow_direct disabled, forcing all read operations through the Raft consensus pipeline. This adds unnecessary latency and contention with writes. Enable allow_direct unless strong read-after-write consistency is required (e.g., financial transactions). Most workloads benefit from direct reads.
OPT_SYS_017 — Leafnode Auto Compression with High Count
info · Performance
Flags servers with many leafnode connections using s2_auto compression, which can create a CPU feedback loop under load.
Remediation. Server has a high number of leafnode connections using s2_auto compression. Under load, s2_auto can create a CPU feedback loop: compression increases CPU usage, which increases RTT, which triggers higher compression levels, further increasing CPU. Switch leafnode compression to a fixed level (s2_fast or s2_better) to prevent the feedback loop.
OPT_SYS_018 — High Interior Deletes on Stream
warning · Saturation
Flags streams with a very high number of interior deletes, causing disproportionate memory pressure during recovery and catch-up.
Remediation. Stream has a very high number of interior deletes. The deleted sequence bitmap is held in memory during recovery and replica catch-up, causing disproportionate memory pressure. Consider purging the stream to reset the delete map, or switching to a retention policy that avoids interior deletes.
OPT_SYS_019 — Large Deduplication Window
warning/critical · Saturation
Flags streams with a deduplication window exceeding the threshold and active message flow, risking high memory consumption from the in-memory dedup map.
Remediation. Stream has a deduplication window exceeding the threshold with active message flow. The dedup map holds an in-memory entry (~130-150 bytes) per message published within the window. With UUID-based Nats-Msg-Id headers and high message rates, this can consume gigabytes of memory. Reduce the deduplication window to the minimum required for your publisher retry interval.
OPT_SYS_020 — KV Buckets Without max_age
info · Saturation
Flags KV buckets with no max_age configured that have accumulated a large number of interior deletes (tombstones).
Remediation. KV bucket has no max_age configured and has accumulated a large number of interior deletes (tombstones from deleted keys). Set max_age to automatically expire old entries and their tombstones. Without it, the delete map grows indefinitely and causes high memory usage during node restart or replica catch-up.
OPT_SYS_021 — R1 Streams in Multi-Node Clusters
info · Health
Flags R1 (single-replica) streams in multi-node clusters that have no redundancy.
Remediation. Stream uses R1 (single replica) in a multi-node cluster. If the hosting node goes down, the stream is completely offline until that node recovers. Consider increasing to R3 for critical data that needs high availability. R1 is appropriate for ephemeral, cacheable, or easily reproducible data.
OPT_SYS_022 — Subscription Count Growth
info · Errors
Flags servers where subscriptions are growing monotonically without a corresponding increase in connections, indicating a subscription leak.
Remediation. Server's subscription count is growing monotonically without a corresponding increase in connections. Indicating a subscription leak. Identify the responsible client by examining connection subscription counts, then fix the client application to properly unsubscribe when done.
OPT_SYS_023 — Raft WAL Size Excessive
warning/critical · Saturation
Flags Raft groups with an excessively large write-ahead log, risking disk exhaustion and cascading OOM failures.
Remediation. Raft group WAL has grown excessively large. An unbounded WAL consumes disk and causes cascading failures: disk full -> memory spike (can't flush) -> OOM -> restart -> WAL replay exhausts memory again. Investigate why the WAL is not compacting. Common causes include a stalled follower preventing log truncation, or a raft group with no active consumers to advance the commit index.
OPT_SYS_024 — WorkQueue Discard New with Aggressive Consumer Settings
warning · Consistency
Flags WorkQueue streams using discard_policy=new where consumers have aggressive ack_wait or max_deliver settings, risking message loss.
Remediation. WorkQueue stream with discard_policy: new will reject publishes when the stream is full. If consumers have low max_deliver or short ack_wait, messages may be nacked and discarded before they can be processed, causing silent data loss. Increase ack_wait (recommended >= 30s) and max_deliver (recommended >= 10), or switch the discard policy to old.
OPT_SYS_025 — Sustained Consumer Growth on Stream
warning · Errors
Flags streams where consumer count has been growing steadily, indicating a consumer leak from ephemeral consumers.
Remediation. Stream's consumer count has been growing steadily. This usually indicates ephemeral consumers being created without proper cleanup. Identify the source of consumer creation, set appropriate inactive_threshold on ephemeral consumers, or convert to durable consumers with explicit deletion.
OPT_SYS_026 — Raft Group Peer Count Mismatch
warning · Consistency
Flags Raft groups where the observed peer count exceeds the expected replica count from stream or consumer configuration.
Remediation. Raft group reports more peers than the configured num_replicas. This typically occurs after a peer-remove followed by peer-add where the old peer was not fully removed, or after a replica count decrease that did not fully propagate. Use nats stream cluster peer-remove to remove the extra peer, or update num_replicas to match the desired count.