Manual Test Plan — Long-running projection rebuilds (priority)

The priority plan for #314. Exercises projection rebuilds at durations of minutes to hours under realistic publisher traffic, with operator-visible failure-mode tests for the mid-flight cases CritterWatch users actually hit in production.

Design partner: #309 — Long-running projection rebuild orchestration. Cross-references inline below.

Substrate

TeleHealth is the long-running substrate (Marten.ScaleTesting-derived).

bash

# Default scale — fast iteration (~30s rebuild)
dotnet run --project src/BffHost  # Full scenario

# Scaled up via CLI flags on the TeleHealthPublisher resource
# (5 tenants × 50k events = 250k events default; tune via env vars)
export TELEHEALTH_TENANTS=5
export TELEHEALTH_EVENTS_PER_TENANT=50000

# Upstream max (matches Marten.ScaleTesting)
export TELEHEALTH_TENANTS=50
export TELEHEALTH_EVENTS_PER_TENANT=400000

TeleHealth.Publisher seeds the events once then stays alive idle, so the rebuild runs against a stable event store with no concurrent writes — which isolates the rebuild duration from publisher pressure. Re-running the test with the publisher active (--continuous flag — see TeleHealth.Publisher/EventSeeder.cs) is its own scenario, called out where relevant below.

TelehealthComposite is the projection under test — a 4 + 2 + 2 multi-stream composite that mirrors Marten.ScaleTesting's Stage-2 + Stage-3 split, so rebuild touches every event in the store.

LR-1 — Baseline rebuild at default scale

Field	Value
Setup	Full scenario at default scale (5 tenants × 50k events). Wait for `TeleHealthPublisher` to finish seeding (its `BackgroundService` logs `EventSeeder: seeded N events`). Confirm `TeleHealth.Service` shows `TelehealthComposite` at `Updated` with the full 250k HWM.
Action	Trigger Rebuild on `TelehealthComposite` from the Projection Detail page.
Expected observation	The shard's sequence drops to 0 and starts climbing. The state badge cycles `Updated → Stopped → Running` then `Running` for the duration. Sparkline shows a smooth monotonic climb. Rebuild completes back at the seeded HWM in roughly 30–90 seconds at default scale on a dev rig. Heartbeat stays green throughout.
How to verify	UI: state badge transitions. SQL: `SELECT last_seq_id, agent_status, last_heartbeat FROM telehealth.mt_event_progression WHERE name LIKE 'TelehealthComposite:%';` — `last_heartbeat` advances every poll interval (~5s), `last_seq_id` climbs monotonically, `agent_status` stays `Running` (or transitions through expected rebuild states). API: `service.shardStates[<shardName>].lastHeartbeat` advances; `.sequence` climbs.
Why it matters	Baseline duration + heartbeat liveness across a full rebuild is the operator's "is it making progress" signal. The HWM never freezes if the rebuild is healthy — but if it does, the HWM Frozen alert from #150 signal 1 fires after 30s.

LR-2 — Rebuild at upstream max scale (50 × 400k = 20M events)

Field	Value
Setup	Tune env vars before boot: `TELEHEALTH_TENANTS=50`, `TELEHEALTH_EVENTS_PER_TENANT=400000`. Allow seed to complete (~10-20 min depending on disk).
Action	Trigger Rebuild on `TelehealthComposite`. Capture the wall-clock start time.
Expected observation	Same shape as LR-1 but at minutes to tens of minutes duration. The Projection Detail page's sparkline must remain responsive while the rebuild runs; the Last Advanced timestamp updates every poll cycle. The state badge stays `Running` (never silently flips to `Stale`).
How to verify	Wall-clock start vs UI's Last Advanced delta should be < polling interval at all times. SQL spot-check during the run: `SELECT last_seq_id, last_heartbeat, EXTRACT(EPOCH FROM (now() - last_heartbeat)) AS seconds_since_heartbeat FROM telehealth.mt_event_progression WHERE name LIKE 'TelehealthComposite:%';` — `seconds_since_heartbeat` < poll interval throughout.
#309 cross-ref	The orchestration design discussion in #309 calls out the operator visibility gap during long rebuilds — % progress, ETA, cancellable from the UI. Until #309 lands those affordances, this test verifies the current minimum: liveness signal + sequence advancement are observable.

LR-3 — Mid-flight pause-then-restart on a long rebuild

Field	Value
Setup	Default scale; trigger Rebuild on `TelehealthComposite` per LR-1. Wait until the projection sequence is ~30–60% of seeded HWM (rough mid-flight).
Action	Click Pause on the projection detail page. Wait 5 seconds. Click Restart.
Expected observation	The state badge changes Rebuilding → Paused on the pause click within ~1s. The sequence freezes at the mid-flight value. After the Restart click the rebuild resumes from the paused sequence, not from 0 (no double-rebuild). State returns to Running. The sparkline shows a flat segment during the pause then resumes the climb.
How to verify	SQL: `SELECT last_seq_id, agent_status FROM telehealth.mt_event_progression WHERE name LIKE 'TelehealthComposite:%';` — at pause: `agent_status = Paused`, `last_seq_id` frozen. After restart: `agent_status = Running`, `last_seq_id` resumes climbing from the same point. The post-restart sequence should equal the pre-pause sequence (off by < 1 polling cycle's worth of events).
Why it matters	Pausing a long rebuild is a real operator move — they want to confirm sustainable load before letting it run to completion. Resume must not restart from scratch (which would silently double the rebuild time).

LR-4 — Process kill mid-rebuild → daemon recovery + CritterWatch reconnect

Field	Value
Setup	Default scale; trigger Rebuild per LR-1. Wait until ~30–60% mid-flight.
Action	Kill the `TeleHealth.Service` process from the Aspire dashboard (Stop) or via `pkill -f TeleHealth.Service`. Wait 10 seconds. Restart the service from the Aspire dashboard.
Expected observation	While the service is down: the CritterWatch UI's heartbeat dot for `TeleHealthService` goes amber within ~60s (one missed beat) then red after ~150s (five missed beats). The Projection Detail page surfaces the stale state — last-advanced timestamp ages, state pill turns warning-colored. On restart: heartbeat returns to green; the daemon resumes the in-progress rebuild from the persisted progression row (not from 0); CritterWatch picks up the in-progress shard state within ~one polling cycle (~5s) of the daemon publishing it.
How to verify	SQL during the down window: `last_seq_id` is whatever it was at kill time (frozen). After restart: `last_seq_id` resumes climbing without resetting. The CritterWatch UI's `Last Advanced` timestamp re-anchors to the post-restart polling time. The pre-existing test `Tests.Integration.heartbeats_via_signalr` covers heartbeat liveness in isolation; this scenario adds the rebuild-state-preserved gate on top.
Why it matters	This is the production failure mode — a node bounces, an Aspire restart happens, a deploy lands. The rebuild needs to survive without operator-visible double-counting or silent regression.

LR-5 — Force a projection apply error during rebuild (stop-on-error policy)

Field	Value
Setup	Boot the Incidents sample (not TeleHealth — Incidents is the stop-on-error substrate landed in #316). `dotnet run --project src/Samples/Incidents/Incidents.Service` (or via Aspire). Wait for the publisher to drive a handful of incidents into the store.
Action	Inject a poison event by writing directly into the Incidents event store with a malformed payload (or via the Incidents publisher's `--bad-payload` flag if it exists; otherwise use the `ChaosMonkey` UI's Set Projection Failure Rate to 100% for the affected projection). Trigger Rebuild on `IncidentsByCategory`.
Expected observation	The shard halts on the first apply error — `agent_status` transitions to `Stopped`. The Projection Detail page surfaces: state badge → `Stopped`, Apply Errors card shows the PS#3 stop-on-error rendering (no DLQ button, restart/rebuild guidance), the Errors card surfaces the actual `ApplyException` with stack trace + offending event sequence. No silent skip — the rebuild stops, full stop.
How to verify	UI: `data-testid="apply-error-policy-stop"` is visible (PS#3 selector). The errors card lists the exception text. SQL: `SELECT name, agent_status, last_exception_message FROM incidents.mt_event_progression WHERE name LIKE 'IncidentsByCategory:%';` — `agent_status = Stopped`, `last_exception_message` populated. The PS#3 PR-A frontend test covers the rendering shape; this scenario gates the rebuild-time application of the policy.
Why it matters	Stop-on-error is the reporter's actual config in PS#3 — a misleading "View Related Dead Letters" button would have sent them down a rabbit hole. The fix is in (PS#3 PR-A); this scenario keeps the regression gated.

LR-6 — Cluster failover during a long rebuild

Field	Value
Setup	Full scenario with two BffHost replicas (or two `TeleHealth.Service` replicas — leader-elected projection coordinator picks one of them). Confirm both are `Running` in the Aspire dashboard and the Projection Detail page identifies which node is currently running the projection (Running On Node column / shard state's `RunningOnNode`). Trigger Rebuild per LR-1.
Action	Mid-rebuild, kill the leader node (`pkill -f` the specific instance — the one whose ID matches `RunningOnNode`).
Expected observation	Within 1–2 polling cycles (15–30s) the projection coordinator on the surviving node picks up the work. The Projection Detail page surfaces the new `RunningOnNode` value. The rebuild continues from the persisted progression row, not from 0. CritterWatch's shard-state stream reflects both the brief `Stopped` gap during failover and the resumption on the new node.
How to verify	SQL during failover: `running_on_node` column flips from the killed node's ID to the surviving one. `last_seq_id` continues climbing (modulo the failover gap). The pre-existing `Tests.Integration.MultiTenancy.single_server_provision_on_demand_tests` covers cluster behaviour broadly; this scenario adds the rebuild-state-preserved gate.
Why it matters	Two-node Aspire setups are the local-dev rehearsal for production HA. A leader bounce during rebuild must not restart the rebuild — that would silently double or triple the rebuild duration depending on how often it happens.

LR-7 — Rebuild a single tenant's projection without disturbing others

Field	Value
Setup	TeleHealth at default scale. Confirm all 5 tenants have an established sequence under `TelehealthComposite`. (Telehealth conjoined multi-tenancy + 8-bucket hash partitioning; the shard surface is `TelehealthComposite:All:tenant_<n>` per tenant.)
Action	Trigger Rebuild scoped to `tenant_0001` only (via the per-tenant rebuild affordance on the Projection Detail page; the Rebuild modal's tenant picker).
Expected observation	Only `tenant_0001`'s shard resets its sequence to 0 and re-advances. The other 4 tenants' shards stay at their current sequence with no interruption. The Projections page's per-tenant view selector lets the operator confirm visually.
How to verify	SQL: `SELECT name, last_seq_id, agent_status FROM telehealth.mt_event_progression WHERE name LIKE 'TelehealthComposite:%';` — exactly one row's `last_seq_id` resets and climbs; the rest stay frozen at their pre-rebuild values. The pre-existing `Tests.Integration.MultiTenancy.PerTenantDaemonRebuildTests` covers this at the daemon layer (currently parked per the project's task #303 — restore once unblocked); this scenario adds the UI gate.
Why it matters	Per-tenant rebuild is the only acceptable rebuild path for a 50-tenant store at scale — full-store rebuild would take hours and disrupt every tenant. Reporter scenarios in PS-class issues lean on this; the affordance has to work end-to-end through the BFF UI.

LR-8 — Rebuild with concurrent publisher activity

Field	Value
Setup	TeleHealth at default scale, but start the publisher in `--continuous` mode (CLI flag on `TeleHealth.Publisher`) so it keeps appending events during the rebuild.
Action	Trigger Rebuild on `TelehealthComposite`.
Expected observation	The rebuild has to catch up to a moving HWM rather than a fixed one. The Projection Detail page's gap (events behind) should stay finite — it grows during the rebuild as the publisher outpaces the rebuild, then shrinks as the rebuild catches up after passing the publisher's typical lead. The sparkline shows the rebuild sequence climbing faster than the HWM line. State badge stays `Running` throughout.
How to verify	Gap value over the run: starts at ~HWM-0, climbs (gap grows), peaks, then shrinks to a steady-state lag that depends on publisher rate vs rebuild rate. Both lines (sequence + HWM) should be visible on the sparkline. SQL: poll both `last_seq_id` on the projection row and `last_seq_id` on the HighWaterMark row; both advance, projection eventually catches up.
Why it matters	A real production rebuild rarely happens in a quiet system. The UI has to keep the operator oriented — am I making net progress or falling further behind? — even with both numbers moving.

Open items / not-yet-implemented

The following are expected scenarios per #314 but require features that don't fully exist yet:

Cancellable rebuild from the UI — there's no Cancel action on the Rebuild button today. Tracked under #309. Once added, an LR-9 test step covers it.
Progress % / ETA display during rebuild — the sparkline gives shape but no explicit % progress / ETA. Tracked under #309. Once added, LR-10 covers it.
Per-tenant rebuild visualization — the per-tenant view selector exists (Phase 3a / #209) but the per-tenant rebuild progress rendering needs design. Cross-reference #309.

Add new test steps to this file as those features land; reference the PR + issue they came from.

Cross-reference

Design partner: #309 — Long-running projection rebuild orchestration
Sample family substrate: src/Samples/TeleHealth/ (added in #317)
Pre-existing automated coverage: Tests.Integration.projection_commands, Tests.Integration.MultiTenancy.PerTenantDaemonRebuildTests (parked)
HWM frozen signal: #150 signal 1
PS#3 stop-on-error rendering: #326

Manual Test Plan — Long-running projection rebuilds (priority) ​

Substrate ​

LR-1 — Baseline rebuild at default scale ​

LR-2 — Rebuild at upstream max scale (50 × 400k = 20M events) ​

LR-3 — Mid-flight pause-then-restart on a long rebuild ​

LR-4 — Process kill mid-rebuild → daemon recovery + CritterWatch reconnect ​

LR-5 — Force a projection apply error during rebuild (stop-on-error policy) ​

LR-6 — Cluster failover during a long rebuild ​

LR-7 — Rebuild a single tenant's projection without disturbing others ​

LR-8 — Rebuild with concurrent publisher activity ​

Open items / not-yet-implemented ​

Cross-reference ​