Synadia Insights

Audit Checks Reference

This document is the canonical reference for all automated audit checks. Each check runs against collected monitoring data to surface operational issues and optimization opportunities.

See Audit Checks for the conceptual overview, severity levels, and the operational vs optimization distinction.

The health dot in the UI reflects only operational findings. Any critical finding turns it red (or darker red for 5+); otherwise the color scales with finding count from green through yellow and orange.


Operational Checks

Server

SERVER_001 — Connection Readiness Failure

critical · Health

Flags servers reporting connection readiness failures via the healthz endpoint.

Remediation. Server not ready for connections. Check listener port conflicts, TLS certificate errors, and permission issues. Restart only after identifying the root cause.

SERVER_002 — Server Version Mismatch

warning · Consistency

Identifies servers running a different software version than the cluster majority.

Remediation. Complete the rolling upgrade so all servers run the same version. If intentional, verify compatibility between versions.

SERVER_003 — High CPU Usage

warning · Performance

Flags servers where per-core CPU usage meets or exceeds the threshold.

Remediation. Identify the workload driving CPU usage using the /pprof/profile debug endpoint. Common causes: high-fanout subjects, subscription matching overhead, JetStream write pressure. Scale out with additional servers, optimize subject hierarchies, or increase server capacity.

SERVER_004 — Slow Consumers

critical · Performance

Flags servers with new slow consumer events since the previous epoch.

Remediation. Identify affected clients and increase their message processing throughput. Slow consumer eviction triggers when the server's outbound write to a client times out due to buffer saturation. Increase max_pending in the server config (default 64 MiB) to buffer more, add consumer instances, reduce message rates, or have clients subscribe to fewer subjects.

SERVER_005 — JetStream Memory Pressure

warning · Saturation

Flags servers where JetStream memory usage is at or above 90% of reserved.

Remediation. Reduce memory-backed stream usage, increase max_mem reservation, or convert large memory-backed streams to file-backed storage.

SERVER_006 — JetStream Domain Whitespace

warning · Consistency

Flags servers whose JetStream domain name contains whitespace characters.

Remediation. Remove whitespace from the JetStream domain in the server configuration and reload.

SERVER_007 — Authentication Not Required

critical · Health

Flags servers that do not require client authentication.

Remediation. Enable authentication in the server configuration. Options: NKey authentication, JWT-based operator mode with signed credentials, user/password, TLS certificate mapping, or auth callout for centralized policy enforcement. Avoid no_auth_user in production.

SERVER_008 — Unexpected Server Restart

critical · Change

Detects servers that restarted without an accompanying version upgrade. Compares start times across consecutive epochs and excludes restarts where the server version changed (planned upgrade).

Remediation. This restart was not preceded by a version upgrade. Check server logs for the shutdown reason. Common causes: OOM kill, hardware failure, process crash, or external signal. Review system logs (journalctl, dmesg) and NATS server logs for the exit reason.

SERVER_010 — High Route RTT

warning · Performance

Flags route connections with round-trip time exceeding the threshold.

Remediation. Investigate network latency between cluster peers. Routes are intra-cluster. Check for network congestion, firewall inspection overhead, or co-locate servers in the same region.

SERVER_011 — Connection Count High

warning · Saturation

Flags servers where active connections approach the configured maximum.

Remediation. Increase max_connections in the server config or per-account limits. Distribute clients across more servers using DNS round-robin or a load balancer. Identify and close unnecessary connections.

SERVER_012 — Stale Connections

warning · Errors

Flags servers with new stale connection events since the previous epoch.

Remediation. PING keepalive timeout detected. The server sent PING but the client did not respond with PONG within the configured interval (default: 2 minutes with 3 pings). Common causes: network partition or firewall dropping idle connections, client process crashed without closing the TCP connection, or client event loop blocked and unable to process PINGs. Check firewall idle-timeout settings and client health.

SERVER_013 — Stalled Clients

warning · Performance

Flags servers with new stalled client events since the previous epoch.

Remediation. Stalled clients indicate fast producers blocked by downstream backpressure. The server is throttling the producer because a consumer cannot keep up. Improve consumer processing speed, add consumer instances, or increase max_pending to buffer more. This is normal flow control and an early indicator before slow consumer eviction.

SERVER_014 — JetStream Subsystem Unhealthy

critical · Health

Flags servers with JETSTREAM-type healthz errors.

Remediation. JetStream subsystem unhealthy. No contact with meta leader, not current, or still recovering. Check cluster connectivity and meta group status via nats server report jetstream.

SERVER_015 — Stream Recovery Failure

critical · Consistency

Flags servers with STREAM or CONSUMER-type healthz errors.

Remediation. JetStream stream or consumer could not be recovered from disk. Check for corrupt store files. Use nats stream report to identify affected streams.

SERVER_016 — Account Resolution Failure

warning · Consistency

Flags servers with ACCOUNT-type healthz errors.

Remediation. JetStream account could not be resolved. Verify account JWT configuration and resolver connectivity.

SERVER_017 — JetStream Storage Pressure

warning · Saturation

Flags servers where JetStream storage usage is at or above 90% of reserved.

Remediation. Add retention limits (max_age, max_bytes) to streams consuming the most space, purge stale data, increase max_store reservation, or add disk capacity.

SERVER_018 — High Gateway RTT

warning · Performance

Flags gateway connections with round-trip time exceeding the threshold.

Remediation. Investigate network latency between clusters. Gateways are inter-cluster. High RTT may be expected for cross-region deployments. Consider stream placement closer to consumers to reduce gateway dependency.

SERVER_019 — JetStream Storage vs Configured Limit

warning · Saturation

Flags servers where JetStream storage usage approaches the configured max_store limit, which when exceeded causes Raft failures.

Remediation. Actual JetStream storage usage is approaching the configured max_store limit. Unlike js_reserved_storage (sum of stream reservations), max_store is the hard filesystem limit. When usage exceeds max_store, Raft WAL writes fail and streams become unavailable. Increase max_store in the server config, add disk capacity, or reduce stream storage by purging data or adding retention limits.

Cluster

CLUSTER_001 — Memory Usage Outlier

warning · Saturation

Flags servers whose memory usage exceeds the configured multiplier of their cluster average.

Remediation. Investigate what is consuming memory on the outlier server. Common causes include large Raft state, many subscriptions, or memory-backed streams. Consider rebalancing workload.

CLUSTER_003 — High HA Assets

warning · Saturation

Flags servers with 1000 or more highly-available JetStream assets.

Remediation. Distribute streams and consumers across more servers. Consider reducing replica counts on low-priority streams or consolidating small streams.

CLUSTER_004 — Cluster Name Whitespace

warning · Consistency

Flags servers whose cluster name contains whitespace characters.

Remediation. Remove whitespace from the cluster name in the server configuration and restart the affected servers.

CLUSTER_005 — Route Count Low

warning · Health

Flags servers with fewer cluster routes than expected based on cluster size.

Remediation. Verify network connectivity between cluster peers on the cluster port. Ensure all servers have matching cluster names and correct route URLs (full mesh requires each server to list at least one other). Check firewall rules and DNS resolution for route hostnames. Expected route count is N-1 for a full-mesh cluster of N servers.

CLUSTER_006 — Connection Count Change

warning · Errors

Flags servers where the connection count changed dramatically between epochs, indicating a significant increase or decrease in connected clients.

Remediation. A dramatic connection count change was detected. Investigate in order: misconfigured client reconnect backoff (default is randomized jitter, not zero), authentication failures causing immediate disconnect after connect, slow consumer evictions triggering reconnect loops, load balancer health checks creating ephemeral connections, or network instability. Review client connection URLs and reconnect settings.

CLUSTER_007 — Gateway Disconnection

critical · Health

Flags servers that lost a gateway connection since the previous epoch.

Remediation. Gateway connection to a remote cluster was lost. Gateways auto-reconnect with randomized jitter, so transient network issues resolve automatically. If the disconnection persists: check TLS certificate validity (including OCSP stapling if enabled. Stale OCSP responses are a common cause), verify firewall rules between clusters allow the gateway port, and confirm gateway names are consistent across all clusters.

CLUSTER_008 — Gateway Config Mismatch

warning · Consistency

Flags servers whose set of gateway connections differs from the cluster majority.

Remediation. Ensure all servers in the cluster have identical gateway configuration. Asymmetric gateways cause routing failures. If reject_unknown_cluster is enabled, unlisted gateways are rejected. When disabled (default), gateways can be discovered implicitly via gossip. Verify all intended gateways are listed and TLS configuration matches on both sides.

Account

ACCOUNTS_001 — Account Connection Limit

warning · Saturation

Flags accounts where connections are at or above 90% of the configured limit.

Remediation. Increase the account connection limit in the JWT or server config. Distribute clients across more servers or use separate credentials per service instance.

ACCOUNTS_002 — Slow Consumers

critical · Errors

Flags accounts with new slow consumer events since the previous epoch, aggregated across servers.

Remediation. Identify the affected consumers within this account and increase their processing throughput. Consider adding more consumer instances or reducing message rates.

ACCOUNTS_003 — Inactive JWT Import

critical · Consistency

Detects imports declared in the account JWT but not activated by the server. Diagnoses root cause: missing activation token, expired token, token signed by rotated signing key, or source export not found.

Remediation. JWT import is declared but not activated by the server. Follow the action indicated by the diagnostic detail column: 'Missing activation token'. Request an activation token from the exporting account operator. 'Expired activation token'. Renew the token with the exporting account. 'Activation token signed by rotated signing key'. Re-issue the token with the current signing key. 'Source export not found'. Verify the export exists in the exporting account JWT.

ACCOUNTS_004 — Orphaned Export

warning · Consistency

Flags exports with no matching importer in any account. Uses NATS wildcard subject matching.

Remediation. Remove the unused export from the account JWT, or create the intended import in the consuming account.

ACCOUNTS_005 — No Subscription Interest

info · Consistency

Finds active imports where no client in the importing account subscribes to the imported subject. Uses NATS wildcard subject matching.

Remediation. Verify clients are subscribing to the correct subject. Remove the import if it is no longer needed.

ACCOUNTS_006 — Account Subscription Limit

warning · Saturation

Flags accounts where subscriptions are at or above 90% of the configured limit.

Remediation. Increase the account subscription limit in the JWT or server config. Consolidate clients that subscribe to overlapping subjects or remove unused subscriptions.

JetStream

JetStream checks apply to streams, KV stores, and object stores. Each stream-scoped check result carries an entity_type value (stream, kvstore, or objectstore) indicating which kind of asset the finding relates to.

JETSTREAM_001 — Stream Replica Lag

warning · Consistency

Flags stream replicas whose last sequence number is more than 10% behind the leader.

Remediation. Check for resource contention (CPU, disk I/O, network bandwidth) on the lagging replica's server. Replicas catch up automatically through Raft replication. If auto-catchup stalls, use nats stream cluster peer-remove followed by re-adding to force a full snapshot sync. But only as a last resort.

JETSTREAM_002 — High Subject Cardinality

warning · Saturation

Flags streams with one million or more unique subjects.

Remediation. Review the subject naming scheme. Consider partitioning high-cardinality data across multiple streams or using a flatter subject hierarchy.

JETSTREAM_003 — Stream Message Limit

warning · Saturation

Flags streams where message count is at or above 90% of the limit.

Remediation. Add a retention policy (max_age) to expire old messages, increase max_msgs, or set max_msgs_per_subject to distribute limits across subjects.

JETSTREAM_004 — JS API Request Rate High

warning · Performance

Flags when the JetStream API request rate exceeds the threshold.

Remediation. Identify the source of API requests. Common causes: rapid stream/consumer creation, excessive info lookups, or tight-loop API calls. Reduce concurrent API call volume. The server queues requests internally and publishes an advisory when the queue saturates. Cache JetStream info responses client-side where possible.

JETSTREAM_005 — JS API Pending High

warning · Performance

Flags servers where JetStream API inflight requests exceed the threshold.

Remediation. Reduce the rate of concurrent JetStream API calls. Check for clients making synchronous API calls in tight loops.

JETSTREAM_006 — Consumer Count Change

warning · Errors

Flags when the total consumer count change between epochs exceeds the threshold, indicating a significant increase or decrease.

Remediation. Identify what is creating and destroying consumers rapidly. Use durable consumers to avoid recreation. For ephemeral consumers, set inactive_threshold (default 5s) to a longer duration if consumers are being deleted prematurely due to brief inactivity gaps.

JETSTREAM_007 — JetStream Memory Utilization Critical

critical · Saturation

Flags servers where JetStream memory usage exceeds the critical threshold.

Remediation. Immediately reduce memory-backed stream usage. Convert streams to file-backed storage or reduce max_mem. When memory is exhausted, new stream writes are rejected.

JETSTREAM_008 — Stream Quorum Lost

critical · Health

Flags replicated streams where enough replicas are offline to lose quorum.

Remediation. Restore offline replicas by bringing their servers back online. If servers are permanently lost, remove failed peers via nats stream cluster peer-remove to lower the quorum requirement, then the remaining peers can elect a leader.

JETSTREAM_009 — JS API Error Rate High

warning · Errors

Flags servers where JetStream API errors exceed a percentage of total requests.

Remediation. JetStream API error rate exceeded the threshold on this server. Check server logs for specific API error categories. Common types include permission denials (403), stream/consumer not found (404), and resource exhaustion (503). High error rates often correlate with client misconfiguration (wrong stream names, insufficient permissions) rather than server issues.

JETSTREAM_010 — Stream Byte Limit

warning · Saturation

Flags streams where byte usage is at or above 90% of the limit.

Remediation. Purge stale data, increase max_bytes, enable S2 compression to reduce on-disk size, or add max_age to expire old messages.

JETSTREAM_011 — Stream Consumer Limit

warning · Saturation

Flags streams where consumer count is at or above 90% of the limit.

Remediation. Remove unused or inactive consumers, increase max_consumers, or consolidate consumers that read overlapping subject filters.

JETSTREAM_012 — JetStream Storage Utilization Critical

critical · Saturation

Flags servers where JetStream storage usage exceeds the critical threshold.

Remediation. Immediately free storage by purging or deleting low-priority streams. When storage is exhausted, stream writes fail with I/O errors. Increase max_store or add disk capacity. Set max_bytes on all streams.

JETSTREAM_013 — Stream Subject/Message Count Inconsistency

warning · Consistency

Flags streams where the number of unique subjects exceeds the total message count. An invariant violation indicating filestore corruption.

Remediation. Stream reports more unique subjects than total messages. This is an invariant violation indicating filestore accounting corruption. This condition persists across server restarts. Contact support with the stream details for guidance on recovery. Backing up and recreating the stream may be necessary.

JETSTREAM_014 — Stream Replica Message Count Divergence

critical · Consistency

Flags replicated streams where all replicas report current but have significantly different message counts, indicating filestore corruption or raft state reset.

Remediation. Stream replicas report divergent message counts despite all being current in Raft. This indicates filestore corruption, raft state reset, or interest-based retention sync failure. Compare replica states with nats stream info --all. The replica with the lowest count likely lost data. For interest-based retention streams, consumer ack propagation may have failed. Check consumer states across replicas.

JETSTREAM_015 — Mirror Last Seen Staleness

warning · Consistency

Flags mirror streams where the mirror consumer has stalled. Zero lag but no activity while the source stream continues receiving messages.

Remediation. Mirror stream shows zero lag but hasn't received activity in over 5 minutes while the source stream continues receiving messages. The internal mirror consumer has likely stalled. Perform a leader step-down on the mirror stream to force recreation of the mirror consumer: nats stream cluster step-down STREAM_NAME.

JETSTREAM_016 — JetStream Storage vs Configured Limit Critical

critical · Saturation

Flags servers where JetStream storage usage critically exceeds the configured max_store limit, risking imminent Raft failures.

Remediation. JetStream storage usage has reached critical levels relative to the configured max_store limit. Raft WAL writes will fail when storage exceeds this limit, causing streams to lose quorum and become unavailable. Immediately free storage by purging low-priority streams, increase max_store, or add disk capacity.

JETSTREAM_017 — Mirror Lag Critical

critical · Consistency

Flags mirror streams where mirror lag exceeds the operator-defined io.nats.monitor.lag-critical threshold.

Remediation. The mirror stream is falling behind the source by more than the operator threshold. Check network connectivity to the mirror source and resource contention on the mirror server.

JETSTREAM_018 — Mirror Seen Critical

critical · Consistency

Flags mirror streams where the time since the mirror was last active exceeds the operator-defined io.nats.monitor.seen-critical threshold.

Remediation. The mirror stream has not received data from the source within the operator-defined window. Verify the source stream is active and network connectivity is healthy.

JETSTREAM_019 — Min Sources

critical · Health

Flags streams where the source count is below the operator-defined io.nats.monitor.min-sources threshold.

Remediation. The stream has fewer sources than the operator-defined minimum. Verify that all expected source streams exist and are configured correctly.

JETSTREAM_020 — Max Sources

critical · Health

Flags streams where the source count exceeds the operator-defined io.nats.monitor.max-sources threshold.

Remediation. The stream has more sources than the operator-defined maximum. Remove unexpected sources or update the threshold.

JETSTREAM_021 — Peer Expect

critical · Health

Flags streams where the actual peer count does not match the operator-defined io.nats.monitor.peer-expect threshold.

Remediation. The stream's actual peer count does not match the operator expectation. Check for offline replicas or verify the num_replicas configuration.

JETSTREAM_022 — Peer Lag Critical

critical · Consistency

Flags stream replicas where lag exceeds the operator-defined io.nats.monitor.peer-lag-critical threshold.

Remediation. A stream replica is lagging behind the leader by more than the operator threshold. Check for resource contention on the replica server.

JETSTREAM_023 — Peer Seen Critical

critical · Consistency

Flags stream replicas where the time since the replica was last active exceeds the operator-defined io.nats.monitor.peer-seen-critical threshold.

Remediation. A stream replica has not been active within the operator-defined window. The replica may be offline or experiencing network issues.

JETSTREAM_024 — Message Count Threshold

warning/critical · Saturation

Flags streams where message count exceeds operator-defined thresholds. Direction is inferred from threshold ordering.

Remediation. The stream message count has exceeded the operator-defined threshold. If too many: add retention policies or increase limits. If too few: investigate upstream publishers.

JETSTREAM_025 — Subject Count Threshold

warning/critical · Saturation

Flags streams where subject count exceeds operator-defined thresholds. Direction is inferred from threshold ordering.

Remediation. The stream subject count has exceeded the operator-defined threshold. Review subject naming scheme and consider partitioning high-cardinality data.

Meta Cluster

META_001 — Offline Replica

critical · Health

Flags meta cluster replicas that are reported as offline.

Remediation. Bring the offline server back online. Check server logs and network connectivity. If the server is permanently lost, remove it via nats server cluster peer-remove. This requires the remaining members to have quorum.

META_002 — Leader Disagreement

critical · Health

Flags when multiple servers report themselves as the meta cluster leader.

Remediation. This indicates a transient state during a leader election, not a split-brain (Raft quorum rules prevent split-brain). Multiple servers may briefly report leadership during term transitions. If persistent, check network connectivity between cluster peers and ensure no asymmetric partition exists. Restart the lower-term leader to force convergence.

META_003 — Meta Leader Flapping

warning · Errors

Flags when the meta cluster leader has changed more than the allowed number of times in the recent time window.

Remediation. Meta cluster leader changed more than the allowed number of times in the time window. Raft heartbeats are sent every 1 second and the election timeout is 4-9 seconds. Any disruption longer than 4 seconds triggers a new election. Investigate: network instability between meta cluster peers (even brief packet loss can trigger elections), CPU saturation delaying heartbeat processing, or disk I/O stalls blocking Raft WAL writes. Use nats server report jetstream to see current meta state.

META_004 — Meta Snapshot Slow

warning/critical · Performance

Flags when the meta cluster snapshot duration exceeds the warning or critical threshold.

Remediation. Reduce the number of JetStream assets (streams, consumers) to shrink snapshot size. Check disk I/O performance on meta cluster servers.

META_005 — Meta State Growth

warning · Saturation

Flags when the total number of JetStream asset replicas exceeds the threshold.

Remediation. Remove unused streams and consumers. Consider reducing replica counts or consolidating small streams to reduce the total Raft group count.

META_006 — Meta Quorum Lost

critical · Health

Flags when enough meta cluster peers are offline to lose quorum.

Remediation. Immediately restore offline meta cluster servers. Without quorum, all JetStream API operations are stalled cluster-wide. If servers are permanently lost, stop all remaining meta servers, remove the failed peer's Raft WAL state, and restart to re-bootstrap the meta group. Quorum loss detection occurs within 10 seconds.

META_007 — Even Cluster Size

warning · Consistency

Flags when the meta cluster has an even number of peers.

Remediation. Consider adding or removing a server to make the meta cluster an odd size. Even-numbered clusters have the same quorum requirements as odd (quorum = N/2 + 1), but an odd cluster tolerates one more failure for the same number of servers (e.g., 3 nodes tolerate 1 failure; 4 nodes also tolerate only 1 failure).

META_008 — Meta Pending High

warning · Performance

Flags when the meta cluster leader has a high number of pending Raft operations.

Remediation. High pending operations indicate the meta group is falling behind on consensus. Check server CPU, disk I/O, and network latency on the meta leader. Consider reducing JetStream API request rate.

META_009 — Meta Cluster Size Decreased

critical · Health

Flags when the meta cluster size has decreased between consecutive epochs, indicating a peer was removed or lost.

Remediation. Meta cluster size decreased. A peer was removed or lost. If intentional (planned decommission via nats server cluster peer-remove), verify the remaining cluster has an odd number of peers for optimal quorum. If unintentional, immediately investigate the lost peer. Check server logs, network connectivity, and disk health. A shrinking meta cluster reduces fault tolerance.

Service

SERVICE_001 — Service Version Mismatch

warning · Consistency

Flags services where instances report different client versions or languages.

Remediation. Complete the rolling deployment so all instances run the same version. If intentional, verify compatibility between versions.

SERVICE_002 — Service Down

critical · Health

Flags services that had instances in the previous epoch but zero in the current epoch.

Remediation. Restart the service instances. Check application logs and orchestration platform (Kubernetes, systemd) for failure reasons.

Leafnode

LEAF_001 — Leafnode Name Whitespace

warning · Consistency

Flags leafnode connections whose remote server name contains whitespace.

Remediation. Remove whitespace from the server name in the leafnode's configuration and restart it.

LEAF_002 — High Leaf RTT

warning · Performance

Flags leafnode connections with round-trip time exceeding the threshold.

Remediation. Investigate network latency between the leaf server and the hub. Consider co-locating leaf servers closer to hubs or checking for network congestion.

LEAF_003 — Leafnode Subscription Count High

warning/critical · Saturation

Flags leafnode connections carrying a large number of subscriptions, which can cause hub processing to exceed the stale connection timeout.

Remediation. Leafnode carries a high subscription count which increases hub processing time during connection establishment. If processing exceeds the 2-second stale connection timeout, the connection will be dropped and retried in a loop. Reduce the number of subscriptions propagated across the leafnode. Use explicit exports/imports instead of wildcard subscriptions, or consolidate subscriber applications.

Connection

CONN_001 — High Client RTT

warning · Performance

Flags client connections with round-trip time exceeding 100 ms.

Remediation. Move the client closer to the NATS server or use a leafnode to bridge the distance. Check for network congestion.

CONN_002 — Client Pending Pressure

warning · Performance

Flags client connections with more than 1 MiB of pending bytes.

Remediation. Increase the client's message processing throughput or reduce the publish rate to this subscriber. The default pending buffer is 64 MiB (max_pending in server config). This is an early warning before slow consumer disconnection.

CONN_003 — Connection Stopped

info · Errors

Flags connections that disconnected with a non-empty reason.

Remediation. Connection disconnected with a non-empty stop reason. Review the stop reason in the detail column for the specific cause. Common reasons: 'Slow Consumer - Loss' (client could not keep up with message rate), 'Authentication Failure' (invalid or expired credentials), 'Server Shutdown' (planned maintenance), 'Maximum Connections Exceeded' (server or account limit reached).

Consumer

CONSUMER_001 — Consumer Replica Offline

critical · Health

Flags consumer replicas that are reported as offline.

Remediation. Bring the offline server back online. Check server logs for the reason the replica went offline.

CONSUMER_002 — Consumer Replica Lag

warning · Consistency

Flags consumer replicas lagging by more than 1000 operations behind the leader.

Remediation. Check for resource contention on the lagging server. If persistent, consider removing and re-adding the consumer replica.

CONSUMER_003 — Consumer Quorum Lost

critical · Health

Flags replicated consumers where enough replicas are offline to lose quorum.

Remediation. Restore offline replicas by bringing their servers back online. Without quorum the consumer cannot make progress.

CONSUMER_004 — Consumer Delivered Below Stream First Sequence

critical · Consistency

Flags consumers whose last delivered position is below the stream's first sequence after a purge or truncation.

Remediation. The consumer's delivered position references a sequence number that no longer exists in the stream. Likely after a stream purge or truncation. The consumer appears healthy but silently misses all new messages. Delete and recreate the consumer, or use nats consumer edit to reset its deliver policy to 'all' or 'last'.

CONSUMER_005 — Consumer Sequence Ahead of Stream Sequence

critical · Consistency

Flags consumers whose delivered position is ahead of the stream's last sequence.

Remediation. The consumer's delivered position is ahead of the stream's last sequence. The consumer is waiting for messages the stream hasn't produced yet. This can happen after stream movement across clusters, leadership transfers with data loss, or raft state resets. Delete and recreate the consumer to reset its position.

CONSUMER_006 — Outstanding Ack Critical

critical · Health

Flags consumers where num_ack_pending exceeds the operator-defined threshold.

Remediation. The consumer has more outstanding acks than the operator threshold. Increase consumer throughput, scale consumers, or raise the threshold in stream/consumer metadata.

CONSUMER_007 — Waiting Critical

critical · Health

Flags consumers where num_waiting exceeds the operator-defined threshold.

Remediation. The consumer has more waiting pull requests than the operator threshold. Add consumer instances or increase max_waiting to accommodate the load.

CONSUMER_008 — Unprocessed Critical

critical · Health

Flags consumers where num_pending exceeds the operator-defined threshold.

Remediation. The consumer has more unprocessed messages than the operator threshold. Scale consumer processing capacity or investigate consumer stalls.

CONSUMER_009 — Last Delivery Critical

critical · Health

Flags consumers where the time since the last delivery exceeds the operator-defined threshold.

Remediation. The consumer has not delivered a message within the operator-defined window. Check if the consumer is stalled, paused, or if the stream has stopped receiving messages.

CONSUMER_010 — Last Ack Critical

critical · Health

Flags consumers where the time since the last acknowledgment exceeds the operator-defined threshold.

Remediation. The consumer has not acknowledged a message within the operator-defined window. Check if downstream processing is stalled or if the consumer application is healthy.

CONSUMER_011 — Redelivery Critical

critical · Errors

Flags consumers where num_redelivered exceeds the operator-defined threshold.

Remediation. The consumer is redelivering more messages than the operator threshold. Investigate processing failures, increase ack_wait, or fix downstream errors causing nacks.

CONSUMER_012 — Pinned Consumer Policy Mismatch

critical · Consistency

Flags consumers with io.nats.monitor.pinned metadata that are not using the overflow priority policy.

Remediation. The consumer metadata indicates it should be pinned but the priority_policy is not set to overflow. Update the consumer configuration to use priority_policy=overflow.

User

USER_001 — Bearer Token User

warning · Errors

Flags bearer token users with active connections.

Remediation. Migrate the user to NKey-based authentication where possible. Bearer tokens skip nonce signature verification during CONNECT, relying on JWT validity alone. They are appropriate for WebSocket and HTTP contexts where NKey signing is impractical, but should not be used for long-lived server-to-server connections.

USER_002 — Excessive User Connections

warning · Errors

Flags users with more than 100 active connections.

Remediation. Investigate why a single user has so many connections. Consider using connection pooling or separate user credentials per service instance.

Change

CHANGE_001 — Config Reload Detected

info · Change

Detects servers whose configuration was reloaded by comparing config_load_time between consecutive epochs.

Remediation. Verify the configuration change was intentional. Review server logs to confirm the reload was successful and no errors occurred.

CHANGE_002 — JetStream Domain Changed

warning · Change

Detects servers whose JetStream domain value changed between consecutive epochs.

Remediation. Verify the domain change was intentional. JetStream domain changes can affect stream and consumer routing across clusters.

CHANGE_003 — Account Added or Removed

info · Change

Detects accounts that appeared or disappeared between consecutive epochs.

Remediation. Verify the account change was expected. For new accounts, ensure imports and exports are correctly wired. For removed accounts, confirm no dependent services remain.

CHANGE_004 — Stream Configuration Changed

info · Change

Detects streams whose configuration fields (replicas, retention, limits) changed between consecutive epochs.

Remediation. Verify the stream configuration change was intentional and monitor for downstream effects on consumers.


Optimization Checks

Placement

OPT_PLACE_001 — Cross-Cluster Stream Access

info · Performance

Flags accounts with clients in clusters that have no local stream leaders.

Remediation. Place stream replicas in clusters where clients connect, or migrate clients to clusters with existing stream leaders to reduce gateway traffic.

OPT_PLACE_002 — Consumer Leader Not Co-located

info · Performance

Flags consumers whose leader is in a different cluster than the majority of connections.

Remediation. Use preferred placement tags to co-locate consumer leaders with the majority of subscribing clients. To force a leader election, use nats consumer cluster step-down which may relocate the leader to a better-positioned replica.

OPT_PLACE_003 — High Gateway Traffic Ratio

info · Performance

Flags accounts where more than 30% of traffic is cross-cluster gateway traffic.

Remediation. Review stream and consumer placement for this account. Move workloads closer to data to reduce inter-cluster traffic.

OPT_PLACE_004 — Gateway Interest Mode

info · Performance

Flags gateway account combinations still using optimistic interest mode.

Remediation. Optimistic mode floods all messages to remote clusters until interest is learned. The server auto-transitions to interest-only mode after a subscription activity threshold is reached. If this account is stuck in optimistic mode, verify the gateway is running NATS 2.9+ (where interest-only is the default) or check for high subscription churn preventing the transition.

Cost

OPT_COST_001 — Over-Replicated Inactive Stream

info · Consistency

Flags R3+ streams with no new messages across the selected time range.

Remediation. Reduce the replica count to R1 for inactive streams, or delete the stream if it is no longer needed.

OPT_COST_002 — Memory Storage Large Stream

info · Saturation

Flags memory-backed streams using more than 100 MiB.

Remediation. Convert the stream to file-backed storage if low-latency access is not required. Memory-backed streams consume server RAM directly.

OPT_COST_003 — Wasted JetStream Memory Reservation

info · Consistency

Flags servers where JetStream memory usage is below 20% of reserved capacity.

Remediation. Reduce the JetStream memory reservation to match actual usage, or migrate memory-backed streams to this server to improve utilization.

OPT_COST_004 — Uncompressed Large Stream

info · Consistency

Flags file-backed streams exceeding 1 GiB with no compression enabled.

Remediation. Enable S2 compression on the stream configuration to reduce disk usage and I/O costs.

OPT_COST_005 — Wasted JetStream Storage Reservation

info · Consistency

Flags servers where JetStream storage usage is below 20% of reserved capacity.

Remediation. Reduce the JetStream storage reservation to match actual usage, or migrate file-backed streams to this server to improve utilization.

Balance

OPT_BALANCE_001 — Uneven Leader Distribution

info · Saturation

Flags servers hosting disproportionately many stream and consumer leaders.

Remediation. Use nats stream cluster step-down and nats consumer cluster step-down to redistribute leaders across the cluster. Target servers with the highest leader counts first.

OPT_BALANCE_002 — Connection Hotspot

info · Saturation

Flags servers with more than double the cluster average connections.

Remediation. Review client connection configuration. Use DNS round-robin or load balancer to distribute connections more evenly across cluster servers.

OPT_BALANCE_003 — Subscription Hotspot

info · Saturation

Flags servers with more than double the cluster average subscriptions.

Remediation. Redistribute client connections to balance subscription load. Check for clients with excessive subscriptions.

OPT_BALANCE_004 — Stream Replica Count Imbalance

info · Saturation

Flags servers hosting disproportionately many stream replicas.

Remediation. Use placement tags to distribute new streams more evenly. Consider removing and re-adding replicas to rebalance.

OPT_BALANCE_005 — JetStream Storage Skew

info · Saturation

Flags servers whose JetStream storage exceeds double the cluster average.

Remediation. Migrate large streams to other cluster servers or add storage capacity to balance disk usage.

OPT_BALANCE_006 — Account Connection Concentration

info · Saturation

Flags servers hosting more than 70% of an account's connections.

Remediation. Configure client connection URLs to include multiple servers. Use a load balancer to spread connections across the cluster.

OPT_BALANCE_007 — Stream-Consumer Leader Co-location

info · Saturation

Flags streams where the stream leader's server also hosts a disproportionate share of consumer leaders.

Remediation. The stream leader's server hosts more than half of the consumer leaders for this stream, concentrating I/O and CPU load on a single node. Use nats consumer cluster step-down to redistribute consumer leaders across the cluster.

OPT_BALANCE_008 — JetStream Storage Saturation with Skew

warning · Saturation

Flags servers with high JetStream storage utilization where the cluster also exhibits significant storage skew between nodes.

Remediation. Server is near JetStream storage capacity and the cluster has significant storage imbalance between nodes. Migrate streams from the saturated server to underutilized peers, or increase storage on the saturated server. Use placement tags to guide future stream placement.

Account

OPT_ACCT_001 — Account Storage Quota Approaching Limit

warning · Saturation

Flags accounts where JetStream storage reservations approach the configured quota.

Remediation. Account's JetStream storage reservations are approaching the configured quota. When the quota is reached, all new stream creates and stream writes for this account will fail. Reduce stream max_bytes reservations, delete unused streams, or increase the account's js_disk_storage limit. Note: NATS enforces quotas by reservation (max_bytes x num_replicas), not actual bytes used.

OPT_ACCT_002 — Excessive JWT Size

warning · Consistency

Flags accounts with unusually large JWT claims, indicating excessive permissions or revocations.

Remediation. Account JWT is unusually large, likely due to excessive permissions, many revocations, or a large number of signing keys. Large JWTs increase memory usage and slow account resolution on every connection. Review the account's permissions and revocations. Consolidate wildcard permissions where possible, and prune expired revocations.

Idle Resources

OPT_IDLE_001 — Underutilized Server

info · Health

Flags servers that remained nearly idle across the selected time range.

Remediation. Consider decommissioning the server or migrating workload to it from busier servers.

OPT_IDLE_002 — Inactive Stream

info · Health

Flags unsealed streams that received no new messages across the time range.

Remediation. Delete the stream if it is no longer needed, or seal it to prevent accidental writes. If temporarily inactive, no action is needed.

OPT_IDLE_003 — Inactive Consumer

info · Consistency

Flags consumers that made no delivery progress across the time range.

Remediation. Delete the consumer if it is no longer processing messages. Check whether the subscribing application is running.

OPT_IDLE_004 — Drained Consumer

info · Consistency

Flags consumers fully caught up with zero pending on an inactive stream.

Remediation. Consider deleting the consumer since its stream has no new messages and all existing messages have been processed.

OPT_IDLE_005 — Inactive Account

info · Health

Flags non-system accounts with no connections or throughput for the configured inactivity threshold (default 24h).

Remediation. Review whether the account is still needed. Remove or disable it if no longer in use.

OPT_IDLE_006 — Disconnected Users

info · Consistency

Flags non-system account users with no active connections at the current epoch.

Remediation. Verify whether the user credential is still in use. Revoke the user if no longer needed.

OPT_IDLE_007 — Idle Client Connections

info · Consistency

Flags client connections idle for more than 5 minutes with zero messages.

Remediation. Client connection has been idle with zero messages for longer than the threshold. Diagnostic steps: check the subscription count. If zero, the connection is likely leaked (connected but never subscribed). If subscriptions exist, check the client library name and version. It may be a monitoring or health-check client that connects but does not publish or subscribe to active subjects. Close leaked connections to free server resources.

System Improvement

OPT_SYS_001 — Streams Without Limits

info · Consistency

Flags streams with no message, byte, or age retention limits.

Remediation. Configure at least one retention limit (max_msgs, max_bytes, max_age, or max_msgs_per_subject) to prevent unbounded disk growth.

OPT_SYS_002 — High Consumer Redelivery

warning · Errors

Flags consumers with a redelivery rate exceeding 10%.

Remediation. Redelivery rate exceeded the threshold. Messages are being delivered multiple times to this consumer. Common causes: processing time exceeds ack_wait (default 30s), application panics before acknowledging, or incorrect ack logic (acking the wrong message). Set max_deliver to cap retry attempts and prevent infinite redelivery loops. Configure backoff for exponential retry spacing. Increase ack_wait if processing legitimately takes longer.

OPT_SYS_003 — Ack Pending Buildup

warning · Errors

Flags consumers approaching their maximum ack pending limit.

Remediation. Scale out consumer instances to process messages faster, increase max_ack_pending (default 1,000), or investigate why messages are not being acknowledged. Ack pending can also be limited at the stream level via consumer limits and at the account level.

OPT_SYS_004 — Unbound Push Consumer

warning · Errors

Flags push consumers with no subscriber currently bound.

Remediation. Start the subscribing application or convert to a pull consumer. While unbound, messages are delivered to the deliver subject with no receiver. They accumulate in ack pending and trigger redeliveries until max_deliver is reached.

OPT_SYS_005 — Route Pending Pressure

warning · Performance

Flags route connections with more than 1 MiB of pending data.

Remediation. Investigate network bandwidth between cluster peers. Reduce message rates on high-volume intra-cluster subjects or upgrade network capacity between peers.

OPT_SYS_006 — Leaf Compression Disabled

info · Consistency

Flags leaf connections with compression disabled.

Remediation. Enable S2 compression in the leafnode configuration (compression: s2_auto for adaptive compression based on RTT). Available modes: s2_fast, s2_better, s2_auto (default when enabled). Configure on both hub and leaf sides.

OPT_SYS_007 — Raft Apply Lag

warning · Performance

Flags Raft groups where committed-applied gap exceeds 100 entries.

Remediation. Check disk I/O and CPU on the affected server. The apply lag indicates the server is falling behind in processing committed Raft entries.

OPT_SYS_008 — Unlimited JetStream Account

info · Consistency

Flags non-system accounts with JetStream enabled but no storage limits.

Remediation. Set JetStream memory and disk storage limits in the account JWT to prevent a single account from exhausting cluster resources.

OPT_SYS_009 — Leaderless Raft Group

critical · Health

Raft group has no elected leader and cannot process writes.

Remediation. Investigate cluster connectivity. A leaderless group is detected within 10 seconds of quorum loss. Ensure a quorum of peers is online and reachable. Check server logs for election failures or network partition indicators.

OPT_SYS_010 — Raft IPQ Backpressure

warning · Performance

Internal queue lengths for a raft group exceed threshold, indicating processing backlog.

Remediation. High IPQ lengths indicate Raft internal queues (proposals, append entries, apply, responses) are backing up. The apply queue is most critical. It means the upper layer (JetStream) cannot consume committed entries fast enough. Check server CPU, disk I/O, and network latency.

OPT_SYS_011 — Subscription Fanout Anomaly

info · Consistency

Flags servers where max fanout is disproportionately higher than average fanout.

Remediation. Investigate subjects with high subscriber counts. A large max-to-average fanout ratio indicates one or more subjects with excessive subscribers, which can create hot spots.

OPT_SYS_012 — Subscription Churn

info · Errors

Flags servers with excessive subscription insert and remove operations since the previous epoch.

Remediation. Excessive subscription insert and remove operations detected. Two diagnostic paths: (1) If a single client is responsible, it is likely a misbehaving application that subscribes/unsubscribes in a loop. Identify it via connection name or IP and fix the client code. (2) If many clients are responsible, it is likely a reconnection storm. Clients reconnecting simultaneously re-subscribe all at once. Check for a preceding network event or server restart that triggered mass reconnection.

OPT_SYS_013 — Raft Sustained Catching Up

warning · Health

Flags Raft groups with a member in catching-up state.

Remediation. Check disk I/O, network bandwidth, and CPU on the catching-up server. If the server is persistently behind, it may need more resources or a re-sync.

OPT_SYS_014 — Gateway Pending Pressure

warning · Performance

Flags gateway connections with more than 1 MiB of pending data.

Remediation. Investigate network bandwidth between clusters. Reduce inter-cluster message rates by improving stream/consumer placement, or upgrade inter-cluster network capacity.

OPT_SYS_015 — Consumer ACK Floor Divergence

warning/critical · Errors

Flags consumers where the gap between delivered position and ACK floor is disproportionately large relative to max_ack_pending, indicating interleaved acknowledgments.

Remediation. Consumer's ACK floor is far behind its delivered position. This indicates interleaved acknowledgments where messages between the ACK floor and delivered position are tracked individually in memory. Causes include out-of-order processing, selective acking, or slow processing of specific messages. Consider using AckAll policy if ordering permits, or investigate why specific messages are not being acknowledged.

OPT_SYS_016 — Direct Gets Disabled

info · Performance

Flags streams with allow_direct disabled, forcing read operations through the Raft consensus pipeline.

Remediation. Stream has allow_direct disabled, forcing all read operations through the Raft consensus pipeline. This adds unnecessary latency and contention with writes. Enable allow_direct unless strong read-after-write consistency is required (e.g., financial transactions). Most workloads benefit from direct reads.

OPT_SYS_017 — Leafnode Auto Compression with High Count

info · Performance

Flags servers with many leafnode connections using s2_auto compression, which can create a CPU feedback loop under load.

Remediation. Server has a high number of leafnode connections using s2_auto compression. Under load, s2_auto can create a CPU feedback loop: compression increases CPU usage, which increases RTT, which triggers higher compression levels, further increasing CPU. Switch leafnode compression to a fixed level (s2_fast or s2_better) to prevent the feedback loop.

OPT_SYS_018 — High Interior Deletes on Stream

warning · Saturation

Flags streams with a very high number of interior deletes, causing disproportionate memory pressure during recovery and catch-up.

Remediation. Stream has a very high number of interior deletes. The deleted sequence bitmap is held in memory during recovery and replica catch-up, causing disproportionate memory pressure. Consider purging the stream to reset the delete map, or switching to a retention policy that avoids interior deletes.

OPT_SYS_019 — Large Deduplication Window

warning/critical · Saturation

Flags streams with a deduplication window exceeding the threshold and active message flow, risking high memory consumption from the in-memory dedup map.

Remediation. Stream has a deduplication window exceeding the threshold with active message flow. The dedup map holds an in-memory entry (~130-150 bytes) per message published within the window. With UUID-based Nats-Msg-Id headers and high message rates, this can consume gigabytes of memory. Reduce the deduplication window to the minimum required for your publisher retry interval.

OPT_SYS_020 — KV Buckets Without max_age

info · Saturation

Flags KV buckets with no max_age configured that have accumulated a large number of interior deletes (tombstones).

Remediation. KV bucket has no max_age configured and has accumulated a large number of interior deletes (tombstones from deleted keys). Set max_age to automatically expire old entries and their tombstones. Without it, the delete map grows indefinitely and causes high memory usage during node restart or replica catch-up.

OPT_SYS_021 — R1 Streams in Multi-Node Clusters

info · Health

Flags R1 (single-replica) streams in multi-node clusters that have no redundancy.

Remediation. Stream uses R1 (single replica) in a multi-node cluster. If the hosting node goes down, the stream is completely offline until that node recovers. Consider increasing to R3 for critical data that needs high availability. R1 is appropriate for ephemeral, cacheable, or easily reproducible data.

OPT_SYS_022 — Subscription Count Growth

info · Errors

Flags servers where subscriptions are growing monotonically without a corresponding increase in connections, indicating a subscription leak.

Remediation. Server's subscription count is growing monotonically without a corresponding increase in connections. Indicating a subscription leak. Identify the responsible client by examining connection subscription counts, then fix the client application to properly unsubscribe when done.

OPT_SYS_023 — Raft WAL Size Excessive

warning/critical · Saturation

Flags Raft groups with an excessively large write-ahead log, risking disk exhaustion and cascading OOM failures.

Remediation. Raft group WAL has grown excessively large. An unbounded WAL consumes disk and causes cascading failures: disk full -> memory spike (can't flush) -> OOM -> restart -> WAL replay exhausts memory again. Investigate why the WAL is not compacting. Common causes include a stalled follower preventing log truncation, or a raft group with no active consumers to advance the commit index.

OPT_SYS_024 — WorkQueue Discard New with Aggressive Consumer Settings

warning · Consistency

Flags WorkQueue streams using discard_policy=new where consumers have aggressive ack_wait or max_deliver settings, risking message loss.

Remediation. WorkQueue stream with discard_policy: new will reject publishes when the stream is full. If consumers have low max_deliver or short ack_wait, messages may be nacked and discarded before they can be processed, causing silent data loss. Increase ack_wait (recommended >= 30s) and max_deliver (recommended >= 10), or switch the discard policy to old.

OPT_SYS_025 — Sustained Consumer Growth on Stream

warning · Errors

Flags streams where consumer count has been growing steadily, indicating a consumer leak from ephemeral consumers.

Remediation. Stream's consumer count has been growing steadily. This usually indicates ephemeral consumers being created without proper cleanup. Identify the source of consumer creation, set appropriate inactive_threshold on ephemeral consumers, or convert to durable consumers with explicit deletion.

OPT_SYS_026 — Raft Group Peer Count Mismatch

warning · Consistency

Flags Raft groups where the observed peer count exceeds the expected replica count from stream or consumer configuration.

Remediation. Raft group reports more peers than the configured num_replicas. This typically occurs after a peer-remove followed by peer-add where the old peer was not fully removed, or after a replica count decrease that did not fully propagate. Use nats stream cluster peer-remove to remove the extra peer, or update num_replicas to match the desired count.

Previous
Search