Deploying CritterWatch as a cluster
CritterWatch is single-node by default — no extra configuration needed for small deployments. Switching to a horizontally-scaled cluster (2+ BFF nodes behind a load balancer) is opt-in and additive.
Default-on since #237.
AddCritterWatchServices(...)now turns on cluster partitioning by default — supplyconfigureClusterShardedTopologymatching your transport mix. PassenableClusterPartitioning: falseto opt out (mostly relevant for integration tests thatDisableAllExternalWolverineTransports()). Producer side wired symmetrically viaAddCritterWatchMonitoring(..., configureShardedTopology: ...). The legacy singlecritterwatchlistener is still kept by the BFF for backwards compatibility with monitored services that haven't opted in to the producer-side hook — it'sListenOnlyAtLeader()-pinned so a multi-node BFF routes legacy traffic through exactly one node.
What changes when you cluster
| Concern | Single node | Clustered |
|---|---|---|
| Per-service single-writer | local (in-process slots) | GlobalPartitioned distributes by service id; no two nodes process the same service's updates concurrently |
| Periodic alert evaluators / metrics scrapers | run in-process | publish a tick on every node; the matching LocalQueueFor<Tick>().ListenOnlyAtLeader() handler runs on the elected leader only — no duplicate alerts, scrapes, or Slack/email/webhook side effects |
| SignalR fan-out across browsers | in-process hub | Redis backplane fans out every server-pushed message to all connected clients across nodes |
| Marten async daemon | self-distributes via Wolverine-managed subscription distribution (unchanged) | same |
Enabling the Redis SignalR backplane
The backplane is config-driven: set a redis connection string and CritterWatchHostingExtensions.AddCritterWatch automatically chains AddStackExchangeRedis() onto the SignalR builder. Absent the connection string, single-node SignalR (the default) keeps working.
Aspire (dev / test)
The Aspire BffHost declares a redis resource and references it from the BFF — Aspire injects ConnectionStrings__redis automatically, so a clustered dev run is dotnet run from src/BffHost with no extra knobs.
Docker Compose
docker-compose.yml ships a redis:7-alpine service on the default port. Hosts running outside Aspire set the connection string in appsettings.json (or via ConnectionStrings__redis):
{
"ConnectionStrings": {
"redis": "localhost:6379"
}
}Azure SignalR (opt-in, documented-and-supported)
Wolverine.SignalR uses the standard ASP.NET Core Hub / IHubContext, so any scale-out provider that hooks into the SignalR DI builder fans out cross-node with no broadcast-code changes. To use Azure SignalR Service instead of the Redis backplane, do not set the redis connection string and add Azure SignalR alongside AddCritterWatch:
builder.AddCritterWatch(connectionString);
builder.Services.AddSignalR().AddAzureSignalR();Exactly one backplane per deployment. CritterWatch only ships the Redis integration out of the box; Azure SignalR is documented but not bundled or CI-exercised.
Enabling global partitioning (per-service single-writer)
The Redis backplane handles fan-out of outbound SignalR traffic. Global partitioning is the matching story on the inbound side: it guarantees that all updates for a given monitored service land on a single BFF node, cluster-wide, so two BFF nodes never race to project the same ServiceSummary aggregate. It's opt-in and additive — single-node deployments don't need it.
Pick N = your expected BFF node count
The integer N you pass to UseSharded…Queues(...) is the partition count: that many physical sharded queues are declared on the transport, and Wolverine hashes each message's group id (the monitored service's ServiceName / Id) mod N to decide which slot it lands on. Each slot is owned by exactly one BFF node.
Set N to the number of BFF nodes you expect to run. With N = 5 and 3 nodes, two nodes carry two slots each and one carries one — load skews slightly but every node has work. With N smaller than your node count, some BFF nodes sit idle (Wolverine assigns each slot to a single node). With N much larger than your node count, you pay overhead for queues you don't need.
N must agree exactly between the producer side and the consumer side — mismatched hashes route to slots the consumer isn't listening on. Pick one value and centralise it (a shared constant, environment variable, or the service-handshake mechanism the BFF already uses for capability negotiation).
Wire the consumer side (BFF)
opts.AddCritterWatchServices(
NpgsqlDataSource.Create(connectionString),
// enableClusterPartitioning defaults to true (#237) — listed here just
// for completeness. Pass false to opt out.
configureClusterShardedTopology: topology =>
{
// Mix and match per the transports this BFF actually uses.
topology.UseShardedRabbitQueues("critterwatch", 5);
// topology.UseShardedAmazonSqsQueues("critterwatch", 5);
// topology.UseShardedAzureServiceBusQueues("critterwatch", 5);
});Without the callback, AddCritterWatchServices throws ArgumentNullException("configureClusterShardedTopology", "… UseShardedRabbitQueues …") — Wolverine's GlobalPartitionedMessageTopology.AssertValidity() requires a sharded external topology be registered at the same time as the message subscription, and a parameter-anchored error makes the missing argument obvious instead of bubbling Wolverine's deeper "external transport topology must be configured" message.
Wire the producer side (every monitored service)
opts.AddCritterWatchMonitoring(
critterWatchUri: new Uri("rabbitmq://queue/critterwatch"),
systemControlUri: new Uri("rabbitmq://queue/my_service_control"),
configureShardedTopology: topology =>
{
// Same value of N. Same transport-specific call.
topology.UseShardedRabbitQueues("critterwatch", 5);
});Azure Service Bus variant
The same shape with UseShardedAzureServiceBusQueues — sample BFF + producer pair:
// BFF
opts.AddCritterWatchServices(
NpgsqlDataSource.Create(connectionString),
configureClusterShardedTopology: topology =>
{
topology.UseShardedAzureServiceBusQueues("critterwatch", 5);
});
// Each monitored service
opts.AddCritterWatchMonitoring(
critterWatchUri: new Uri("azureservicebus://queue/critterwatch"),
systemControlUri: new Uri("azureservicebus://queue/my_service_control"),
configureShardedTopology: topology =>
{
topology.UseShardedAzureServiceBusQueues("critterwatch", 5);
});The N-matching constraint is identical to the RabbitMQ case (5 here must match on both sides). The same rollout-order rule applies — BFF first, then monitored services — so the consumer is on the matching N before any producer starts publishing onto the sharded slots.
UseShardedAmazonSqsQueues follows the same pattern. The transport-specific call resolves the shard naming and provisioning to whatever the underlying broker convention is (queue per shard on RabbitMQ / SQS, subscription per shard on ASB).
Once both sides ship
ICritterWatchMessage traffic (ServiceUpdates, AgentHealthReport, ShardStatesChanged, …) flows over the sharded slots. Heartbeats (WolverineHeartbeat) and MessageHandlingMetrics keep flowing over the unsharded critterwatch URI you've always passed — the BFF deliberately doesn't shard those, and a sharded slot with no listener would dead-letter them.
What if I only ship one side?
The producer and consumer hooks are independent rollouts. Both have a default-off path so half-finished migrations are graceful:
| Producer | Consumer | What happens |
|---|---|---|
| sharded | sharded | Full per-service single-writer. Recommended for multi-BFF deployments. |
| sharded | single (default) | Producer's ICritterWatchMessage lands on the sharded slots but no BFF is listening on them — messages stall on the broker. Don't roll out the producer side until the BFF is on the matching N. |
| single (default) | sharded | BFF still listens on the legacy single critterwatch queue alongside the sharded slots. Older monitored services keep working untouched. Roll out the consumer side first. |
| single (default) | single (default) | Single-queue legacy path. The BFF's listener is ListenOnlyAtLeader()-pinned (see below) so multi-node BFFs don't split-brain on it. |
Legacy single-queue listener is leader-pinned
Even without partitioning, the BFF's ListenToRabbitQueue("critterwatch") and ListenToSqsQueue("critterwatch") call .ListenOnlyAtLeader(). In a single-node deployment that's identical to the pre-leader-aware default (the sole node is the leader). In a multi-node deployment, only one node consumes the legacy queue at a time — preventing the optimistic-concurrency retry storms and split-brain ServiceSummary processing that competing consumers on a single queue would otherwise cause. The sharded slots stay leader-agnostic; only this back-compat queue is leader-pinned.
Load balancer requirements
Health endpoints are LB-appropriate (each node serves /health); the boot-smoke CI gate (#216) asserts the same endpoint reports Healthy.
No sticky sessions required. The Redis backplane fans every SignalR send to every node, so a client that connects to node B receives updates produced on node A. The same property holds for Azure SignalR. Configure the LB for plain round-robin (or least-connections) over WebSocket — sticky sessions add no value and can mask backplane misconfiguration.
Cluster correctness audit (#217)
The 7 BackgroundServices in CritterWatch.Services are classified as follows:
| Service | Classification | Why |
|---|---|---|
MetricsAlertEvaluator | cluster-singleton | persists alert records + publishes lifecycle messages |
ProjectionAlertEvaluator | cluster-singleton | same shape |
PrometheusScrapingService | cluster-singleton | external HTTP fetch + persistence |
MetricsIdleReEvaluator | cluster-singleton | re-publishes rollups |
StateRefreshService | per-node OK | refreshes its own connected clients; backplane fans out |
AlertBatchAccumulator | per-node OK | batches what this node received; backplane fans out |
SignalRBatchAccumulator | per-node OK | same shape |
