Skip to content

Message Flow

This page walks through what happens when (a) something changes in a monitored service and (b) you click a button in the console. The point is to give operators a feel for the latency, the failure modes, and what shows up where.

If you're trying to understand how fresh a number is or why a command "doesn't seem to do anything," this is the page.

Telemetry: service → browser

The five things to know:

  1. Batching is 1 second. A state change inside a service waits up to one second before it leaves the process. This is the dominant latency in everything you see.
  2. Heartbeats publish even when nothing changed. A service that's idle still publishes once per second, so a service going silent in the UI means real silence — the process is gone, the broker is unreachable, or the network is partitioned.
  3. Telemetry queues up if the console is down. The transport buffers the messages. When the console comes back, it catches up; you'll see a brief burst of activity in the timeline.
  4. Persistence happens before browser update. The console writes to PostgreSQL first, then pushes to SignalR. If you reload the page, the data is already there — the live push is just to avoid the reload.
  5. Per-service ordering is preserved. Telemetry from one service is processed in order; telemetry from different services is processed concurrently. So you won't see a stale state from service-A win over a fresh one, even when the console is busy.

Commands: browser → service

The four things to know:

  1. Commands are fire-and-forget. When you click "Replay", the console acknowledges immediately. The service hasn't done the work yet — the work happens asynchronously.
  2. Audit log entries are immediate. The audit log records the click, even before the service confirms. So Replay clicked but DLQ count didn't drop is a visible state — you'll see the audit row but the metric will lag until the service catches up.
  3. Commands are durable. RabbitMQ persists them. If the target service is down, the command waits in the queue and runs when it comes back. There's no "the click didn't take" failure mode for transient outages.
  4. Confirmation is via telemetry, not a response. A successful replay shows up as the DLQ count dropping in the next batch (≤ 1 second). A failed replay shows up as an exception event in the timeline. There is no direct "command result" channel — the operator confirms by watching the metric.

Where to look when something seems wrong

SymptomFirst thing to check
Service is missing from the dashboardService is publishing telemetry — check the broker for a critterwatch queue with messages flowing in.
Service is showing as silent (red heartbeat dot)Process up? Broker reachable from the service? Network path between them OK?
Click did nothingOpen the Audit Log — is the click recorded? If yes, the command left the console; the issue is on the service side or in the transport. If no, the console rejected it (likely Operations gating disabled).
Numbers look stale on one tabGlance at the connection indicator in the header. Yellow / red means SignalR has dropped — the page will reconcile when it reconnects, or refresh manually.
"I changed a setting and it didn't apply"Most settings (alert thresholds, retention) take effect on the next evaluator pass — typically ≤ 30 seconds. Some (transport / connection settings) need a console restart.

Operations gating

Every state-changing command (Replay, Pause, Drain, Restart, Eject, Add/Remove tenant, etc.) checks the global Operations Enabled flag before sending. When the flag is off, the corresponding buttons in the UI render disabled with a tooltip explaining why, and the HTTP API refuses the call.

Read-only queries (the dashboard, the timeline, store inspector) are always allowed.

This is the recommended way to put the console into a "look but don't touch" mode for production change freezes. See Settings → Connection Settings.

Released under the MIT License.