Deploying CritterWatch as a cluster

CritterWatch is single-node by default — no extra configuration needed for small deployments. Switching to a horizontally-scaled cluster (2+ BFF nodes behind a load balancer) is opt-in and additive.

Default-on since #237. AddCritterWatchServices(...) now turns on cluster partitioning by default — supply configureClusterShardedTopology matching your transport mix. Pass enableClusterPartitioning: false to opt out (mostly relevant for integration tests that DisableAllExternalWolverineTransports()). Producer side wired symmetrically via AddCritterWatchMonitoring(..., configureShardedTopology: ...). The legacy single critterwatch listener is still kept by the BFF for backwards compatibility with monitored services that haven't opted in to the producer-side hook — it's ListenOnlyAtLeader()-pinned so a multi-node BFF routes legacy traffic through exactly one node.

What changes when you cluster

Concern	Single node	Clustered
Per-service single-writer	local (in-process slots)	`GlobalPartitioned` distributes by service id; no two nodes process the same service's updates concurrently
Periodic alert evaluators / metrics scrapers	run in-process	publish a tick on every node; the matching `LocalQueueFor<Tick>().ListenOnlyAtLeader()` handler runs on the elected leader only — no duplicate alerts, scrapes, or Slack/email/webhook side effects
SignalR fan-out across browsers	in-process hub	Redis backplane fans out every server-pushed message to all connected clients across nodes
Marten async daemon	self-distributes via Wolverine-managed subscription distribution (unchanged)	same

Enabling the Redis SignalR backplane

The backplane is config-driven: set a redis connection string and CritterWatchHostingExtensions.AddCritterWatch automatically chains AddStackExchangeRedis() onto the SignalR builder. Absent the connection string, single-node SignalR (the default) keeps working.

Aspire (dev / test)

The Aspire BffHost declares a redis resource and references it from the BFF — Aspire injects ConnectionStrings__redis automatically, so a clustered dev run is dotnet run from src/BffHost with no extra knobs.

Docker Compose

docker-compose.yml ships a redis:7-alpine service on the default port. Hosts running outside Aspire set the connection string in appsettings.json (or via ConnectionStrings__redis):

jsonc

{
  "ConnectionStrings": {
    "redis": "localhost:6379"
  }
}

Azure SignalR (opt-in, documented-and-supported)

Wolverine.SignalR uses the standard ASP.NET Core Hub / IHubContext, so any scale-out provider that hooks into the SignalR DI builder fans out cross-node with no broadcast-code changes. To use Azure SignalR Service instead of the Redis backplane, do not set the redis connection string and add Azure SignalR alongside AddCritterWatch:

csharp

builder.AddCritterWatch(connectionString);
builder.Services.AddSignalR().AddAzureSignalR();

Exactly one backplane per deployment. CritterWatch only ships the Redis integration out of the box; Azure SignalR is documented but not bundled or CI-exercised.

Enabling global partitioning (per-service single-writer)

The Redis backplane handles fan-out of outbound SignalR traffic. Global partitioning is the matching story on the inbound side: it guarantees that all updates for a given monitored service land on a single BFF node, cluster-wide, so two BFF nodes never race to project the same ServiceSummary aggregate. It's opt-in and additive — single-node deployments don't need it.

Pick `N` = your expected BFF node count

The integer N you pass to UseSharded…Queues(...) is the partition count: that many physical sharded queues are declared on the transport, and Wolverine hashes each message's group id (the monitored service's ServiceName / Id) mod N to decide which slot it lands on. Each slot is owned by exactly one BFF node.

Set N to the number of BFF nodes you expect to run. With N = 5 and 3 nodes, two nodes carry two slots each and one carries one — load skews slightly but every node has work. With N smaller than your node count, some BFF nodes sit idle (Wolverine assigns each slot to a single node). With N much larger than your node count, you pay overhead for queues you don't need.

N must agree exactly between the producer side and the consumer side — mismatched hashes route to slots the consumer isn't listening on. Pick one value and centralise it (a shared constant, environment variable, or the service-handshake mechanism the BFF already uses for capability negotiation).

Wire the consumer side (BFF)

csharp

opts.AddCritterWatchServices(
    NpgsqlDataSource.Create(connectionString),
    // enableClusterPartitioning defaults to true (#237) — listed here just
    // for completeness. Pass false to opt out.
    configureClusterShardedTopology: topology =>
    {
        // Mix and match per the transports this BFF actually uses.
        topology.UseShardedRabbitQueues("critterwatch", 5);
        // topology.UseShardedAmazonSqsQueues("critterwatch", 5);
        // topology.UseShardedAzureServiceBusQueues("critterwatch", 5);
    });

Without the callback, AddCritterWatchServices throws ArgumentNullException("configureClusterShardedTopology", "… UseShardedRabbitQueues …") — Wolverine's GlobalPartitionedMessageTopology.AssertValidity() requires a sharded external topology be registered at the same time as the message subscription, and a parameter-anchored error makes the missing argument obvious instead of bubbling Wolverine's deeper "external transport topology must be configured" message.

Wire the producer side (every monitored service)

csharp

opts.AddCritterWatchMonitoring(
    critterWatchUri: new Uri("rabbitmq://queue/critterwatch"),
    systemControlUri: new Uri("rabbitmq://queue/my_service_control"),
    configureShardedTopology: topology =>
    {
        // Same value of N. Same transport-specific call.
        topology.UseShardedRabbitQueues("critterwatch", 5);
    });

Azure Service Bus variant

The same shape with UseShardedAzureServiceBusQueues — sample BFF + producer pair:

csharp

// BFF
opts.AddCritterWatchServices(
    NpgsqlDataSource.Create(connectionString),
    configureClusterShardedTopology: topology =>
    {
        topology.UseShardedAzureServiceBusQueues("critterwatch", 5);
    });

// Each monitored service
opts.AddCritterWatchMonitoring(
    critterWatchUri: new Uri("azureservicebus://queue/critterwatch"),
    systemControlUri: new Uri("azureservicebus://queue/my_service_control"),
    configureShardedTopology: topology =>
    {
        topology.UseShardedAzureServiceBusQueues("critterwatch", 5);
    });

The N-matching constraint is identical to the RabbitMQ case (5 here must match on both sides). The same rollout-order rule applies — BFF first, then monitored services — so the consumer is on the matching N before any producer starts publishing onto the sharded slots.

UseShardedAmazonSqsQueues follows the same pattern. The transport-specific call resolves the shard naming and provisioning to whatever the underlying broker convention is (queue per shard on RabbitMQ / SQS, subscription per shard on ASB).

Once both sides ship

ICritterWatchMessage traffic (ServiceUpdates, AgentHealthReport, ShardStatesChanged, …) flows over the sharded slots. Heartbeats (WolverineHeartbeat) and MessageHandlingMetrics keep flowing over the unsharded critterwatch URI you've always passed — the BFF deliberately doesn't shard those, and a sharded slot with no listener would dead-letter them.

What if I only ship one side?

The producer and consumer hooks are independent rollouts. Both have a default-off path so half-finished migrations are graceful:

Producer	Consumer	What happens
sharded	sharded	Full per-service single-writer. Recommended for multi-BFF deployments.
sharded	single (default)	Producer's `ICritterWatchMessage` lands on the sharded slots but no BFF is listening on them — messages stall on the broker. Don't roll out the producer side until the BFF is on the matching `N`.
single (default)	sharded	BFF still listens on the legacy single `critterwatch` queue alongside the sharded slots. Older monitored services keep working untouched. Roll out the consumer side first.
single (default)	single (default)	Single-queue legacy path. The BFF's listener is `ListenOnlyAtLeader()`-pinned (see below) so multi-node BFFs don't split-brain on it.

Legacy single-queue listener is leader-pinned

Even without partitioning, the BFF's ListenToRabbitQueue("critterwatch") and ListenToSqsQueue("critterwatch") call .ListenOnlyAtLeader(). In a single-node deployment that's identical to the pre-leader-aware default (the sole node is the leader). In a multi-node deployment, only one node consumes the legacy queue at a time — preventing the optimistic-concurrency retry storms and split-brain ServiceSummary processing that competing consumers on a single queue would otherwise cause. The sharded slots stay leader-agnostic; only this back-compat queue is leader-pinned.

Load balancer requirements

Health endpoints are LB-appropriate (each node serves /health); the boot-smoke CI gate (#216) asserts the same endpoint reports Healthy.

No sticky sessions required. The Redis backplane fans every SignalR send to every node, so a client that connects to node B receives updates produced on node A. The same property holds for Azure SignalR. Configure the LB for plain round-robin (or least-connections) over WebSocket — sticky sessions add no value and can mask backplane misconfiguration.

Cluster correctness audit (#217)

The 7 BackgroundServices in CritterWatch.Services are classified as follows:

Service	Classification	Why
`MetricsAlertEvaluator`	cluster-singleton	persists alert records + publishes lifecycle messages
`ProjectionAlertEvaluator`	cluster-singleton	same shape
`PrometheusScrapingService`	cluster-singleton	external HTTP fetch + persistence
`MetricsIdleReEvaluator`	cluster-singleton	re-publishes rollups
`StateRefreshService`	per-node OK	refreshes its own connected clients; backplane fans out
`AlertBatchAccumulator`	per-node OK	batches what this node received; backplane fans out
`SignalRBatchAccumulator`	per-node OK	same shape

Deploying CritterWatch as a cluster ​

What changes when you cluster ​

Enabling the Redis SignalR backplane ​

Aspire (dev / test) ​

Docker Compose ​

Azure SignalR (opt-in, documented-and-supported) ​

Enabling global partitioning (per-service single-writer) ​

Pick N = your expected BFF node count ​

Wire the consumer side (BFF) ​

Wire the producer side (every monitored service) ​

Azure Service Bus variant ​

Once both sides ship ​

What if I only ship one side? ​

Legacy single-queue listener is leader-pinned ​

Load balancer requirements ​

Cluster correctness audit (#217) ​

Deploying CritterWatch as a cluster

What changes when you cluster

Enabling the Redis SignalR backplane

Aspire (dev / test)

Docker Compose

Azure SignalR (opt-in, documented-and-supported)

Enabling global partitioning (per-service single-writer)

Pick `N` = your expected BFF node count

Wire the consumer side (BFF)

Wire the producer side (every monitored service)

Azure Service Bus variant

Once both sides ship

What if I only ship one side?

Legacy single-queue listener is leader-pinned

Load balancer requirements

Cluster correctness audit (#217)