Skip to content

Manual Test Plan — Long-running projection rebuilds (priority)

The priority plan for #314. Exercises projection rebuilds at durations of minutes to hours under realistic publisher traffic, with operator-visible failure-mode tests for the mid-flight cases CritterWatch users actually hit in production.

Design partner: #309 — Long-running projection rebuild orchestration. Cross-references inline below.

Substrate

TeleHealth is the long-running substrate (Marten.ScaleTesting-derived).

bash
# Default scale — fast iteration (~30s rebuild)
dotnet run --project src/BffHost  # Full scenario

# Scaled up via CLI flags on the TeleHealthPublisher resource
# (5 tenants × 50k events = 250k events default; tune via env vars)
export TELEHEALTH_TENANTS=5
export TELEHEALTH_EVENTS_PER_TENANT=50000

# Upstream max (matches Marten.ScaleTesting)
export TELEHEALTH_TENANTS=50
export TELEHEALTH_EVENTS_PER_TENANT=400000

TeleHealth.Publisher seeds the events once then stays alive idle, so the rebuild runs against a stable event store with no concurrent writes — which isolates the rebuild duration from publisher pressure. Re-running the test with the publisher active (--continuous flag — see TeleHealth.Publisher/EventSeeder.cs) is its own scenario, called out where relevant below.

TelehealthComposite is the projection under test — a 4 + 2 + 2 multi-stream composite that mirrors Marten.ScaleTesting's Stage-2 + Stage-3 split, so rebuild touches every event in the store.

LR-1 — Baseline rebuild at default scale

FieldValue
SetupFull scenario at default scale (5 tenants × 50k events). Wait for TeleHealthPublisher to finish seeding (its BackgroundService logs EventSeeder: seeded N events). Confirm TeleHealth.Service shows TelehealthComposite at Updated with the full 250k HWM.
ActionTrigger Rebuild on TelehealthComposite from the Projection Detail page.
Expected observationThe shard's sequence drops to 0 and starts climbing. The state badge cycles Updated → Stopped → Running then Running for the duration. Sparkline shows a smooth monotonic climb. Rebuild completes back at the seeded HWM in roughly 30–90 seconds at default scale on a dev rig. Heartbeat stays green throughout.
How to verifyUI: state badge transitions. SQL: SELECT last_seq_id, agent_status, last_heartbeat FROM telehealth.mt_event_progression WHERE name LIKE 'TelehealthComposite:%';last_heartbeat advances every poll interval (~5s), last_seq_id climbs monotonically, agent_status stays Running (or transitions through expected rebuild states). API: service.shardStates[<shardName>].lastHeartbeat advances; .sequence climbs.
Why it mattersBaseline duration + heartbeat liveness across a full rebuild is the operator's "is it making progress" signal. The HWM never freezes if the rebuild is healthy — but if it does, the HWM Frozen alert from #150 signal 1 fires after 30s.

LR-2 — Rebuild at upstream max scale (50 × 400k = 20M events)

FieldValue
SetupTune env vars before boot: TELEHEALTH_TENANTS=50, TELEHEALTH_EVENTS_PER_TENANT=400000. Allow seed to complete (~10-20 min depending on disk).
ActionTrigger Rebuild on TelehealthComposite. Capture the wall-clock start time.
Expected observationSame shape as LR-1 but at minutes to tens of minutes duration. The Projection Detail page's sparkline must remain responsive while the rebuild runs; the Last Advanced timestamp updates every poll cycle. The state badge stays Running (never silently flips to Stale).
How to verifyWall-clock start vs UI's Last Advanced delta should be < polling interval at all times. SQL spot-check during the run: SELECT last_seq_id, last_heartbeat, EXTRACT(EPOCH FROM (now() - last_heartbeat)) AS seconds_since_heartbeat FROM telehealth.mt_event_progression WHERE name LIKE 'TelehealthComposite:%';seconds_since_heartbeat < poll interval throughout.
#309 cross-refThe orchestration design discussion in #309 calls out the operator visibility gap during long rebuilds — % progress, ETA, cancellable from the UI. Until #309 lands those affordances, this test verifies the current minimum: liveness signal + sequence advancement are observable.

LR-3 — Mid-flight pause-then-restart on a long rebuild

FieldValue
SetupDefault scale; trigger Rebuild on TelehealthComposite per LR-1. Wait until the projection sequence is ~30–60% of seeded HWM (rough mid-flight).
ActionClick Pause on the projection detail page. Wait 5 seconds. Click Restart.
Expected observationThe state badge changes Rebuilding → Paused on the pause click within ~1s. The sequence freezes at the mid-flight value. After the Restart click the rebuild resumes from the paused sequence, not from 0 (no double-rebuild). State returns to Running. The sparkline shows a flat segment during the pause then resumes the climb.
How to verifySQL: SELECT last_seq_id, agent_status FROM telehealth.mt_event_progression WHERE name LIKE 'TelehealthComposite:%'; — at pause: agent_status = Paused, last_seq_id frozen. After restart: agent_status = Running, last_seq_id resumes climbing from the same point. The post-restart sequence should equal the pre-pause sequence (off by < 1 polling cycle's worth of events).
Why it mattersPausing a long rebuild is a real operator move — they want to confirm sustainable load before letting it run to completion. Resume must not restart from scratch (which would silently double the rebuild time).

LR-4 — Process kill mid-rebuild → daemon recovery + CritterWatch reconnect

FieldValue
SetupDefault scale; trigger Rebuild per LR-1. Wait until ~30–60% mid-flight.
ActionKill the TeleHealth.Service process from the Aspire dashboard (Stop) or via pkill -f TeleHealth.Service. Wait 10 seconds. Restart the service from the Aspire dashboard.
Expected observationWhile the service is down: the CritterWatch UI's heartbeat dot for TeleHealthService goes amber within ~60s (one missed beat) then red after ~150s (five missed beats). The Projection Detail page surfaces the stale state — last-advanced timestamp ages, state pill turns warning-colored. On restart: heartbeat returns to green; the daemon resumes the in-progress rebuild from the persisted progression row (not from 0); CritterWatch picks up the in-progress shard state within ~one polling cycle (~5s) of the daemon publishing it.
How to verifySQL during the down window: last_seq_id is whatever it was at kill time (frozen). After restart: last_seq_id resumes climbing without resetting. The CritterWatch UI's Last Advanced timestamp re-anchors to the post-restart polling time. The pre-existing test Tests.Integration.heartbeats_via_signalr covers heartbeat liveness in isolation; this scenario adds the rebuild-state-preserved gate on top.
Why it mattersThis is the production failure mode — a node bounces, an Aspire restart happens, a deploy lands. The rebuild needs to survive without operator-visible double-counting or silent regression.

LR-5 — Force a projection apply error during rebuild (stop-on-error policy)

FieldValue
SetupBoot the Incidents sample (not TeleHealth — Incidents is the stop-on-error substrate landed in #316). dotnet run --project src/Samples/Incidents/Incidents.Service (or via Aspire). Wait for the publisher to drive a handful of incidents into the store.
ActionInject a poison event by writing directly into the Incidents event store with a malformed payload (or via the Incidents publisher's --bad-payload flag if it exists; otherwise use the ChaosMonkey UI's Set Projection Failure Rate to 100% for the affected projection). Trigger Rebuild on IncidentsByCategory.
Expected observationThe shard halts on the first apply error — agent_status transitions to Stopped. The Projection Detail page surfaces: state badge → Stopped, Apply Errors card shows the PS#3 stop-on-error rendering (no DLQ button, restart/rebuild guidance), the Errors card surfaces the actual ApplyException with stack trace + offending event sequence. No silent skip — the rebuild stops, full stop.
How to verifyUI: data-testid="apply-error-policy-stop" is visible (PS#3 selector). The errors card lists the exception text. SQL: SELECT name, agent_status, last_exception_message FROM incidents.mt_event_progression WHERE name LIKE 'IncidentsByCategory:%';agent_status = Stopped, last_exception_message populated. The PS#3 PR-A frontend test covers the rendering shape; this scenario gates the rebuild-time application of the policy.
Why it mattersStop-on-error is the reporter's actual config in PS#3 — a misleading "View Related Dead Letters" button would have sent them down a rabbit hole. The fix is in (PS#3 PR-A); this scenario keeps the regression gated.

LR-6 — Cluster failover during a long rebuild

FieldValue
SetupFull scenario with two BffHost replicas (or two TeleHealth.Service replicas — leader-elected projection coordinator picks one of them). Confirm both are Running in the Aspire dashboard and the Projection Detail page identifies which node is currently running the projection (Running On Node column / shard state's RunningOnNode). Trigger Rebuild per LR-1.
ActionMid-rebuild, kill the leader node (pkill -f the specific instance — the one whose ID matches RunningOnNode).
Expected observationWithin 1–2 polling cycles (15–30s) the projection coordinator on the surviving node picks up the work. The Projection Detail page surfaces the new RunningOnNode value. The rebuild continues from the persisted progression row, not from 0. CritterWatch's shard-state stream reflects both the brief Stopped gap during failover and the resumption on the new node.
How to verifySQL during failover: running_on_node column flips from the killed node's ID to the surviving one. last_seq_id continues climbing (modulo the failover gap). The pre-existing Tests.Integration.MultiTenancy.single_server_provision_on_demand_tests covers cluster behaviour broadly; this scenario adds the rebuild-state-preserved gate.
Why it mattersTwo-node Aspire setups are the local-dev rehearsal for production HA. A leader bounce during rebuild must not restart the rebuild — that would silently double or triple the rebuild duration depending on how often it happens.

LR-7 — Rebuild a single tenant's projection without disturbing others

FieldValue
SetupTeleHealth at default scale. Confirm all 5 tenants have an established sequence under TelehealthComposite. (Telehealth conjoined multi-tenancy + 8-bucket hash partitioning; the shard surface is TelehealthComposite:All:tenant_<n> per tenant.)
ActionTrigger Rebuild scoped to tenant_0001 only (via the per-tenant rebuild affordance on the Projection Detail page; the Rebuild modal's tenant picker).
Expected observationOnly tenant_0001's shard resets its sequence to 0 and re-advances. The other 4 tenants' shards stay at their current sequence with no interruption. The Projections page's per-tenant view selector lets the operator confirm visually.
How to verifySQL: SELECT name, last_seq_id, agent_status FROM telehealth.mt_event_progression WHERE name LIKE 'TelehealthComposite:%'; — exactly one row's last_seq_id resets and climbs; the rest stay frozen at their pre-rebuild values. The pre-existing Tests.Integration.MultiTenancy.PerTenantDaemonRebuildTests covers this at the daemon layer (currently parked per the project's task #303 — restore once unblocked); this scenario adds the UI gate.
Why it mattersPer-tenant rebuild is the only acceptable rebuild path for a 50-tenant store at scale — full-store rebuild would take hours and disrupt every tenant. Reporter scenarios in PS-class issues lean on this; the affordance has to work end-to-end through the BFF UI.

LR-8 — Rebuild with concurrent publisher activity

FieldValue
SetupTeleHealth at default scale, but start the publisher in --continuous mode (CLI flag on TeleHealth.Publisher) so it keeps appending events during the rebuild.
ActionTrigger Rebuild on TelehealthComposite.
Expected observationThe rebuild has to catch up to a moving HWM rather than a fixed one. The Projection Detail page's gap (events behind) should stay finite — it grows during the rebuild as the publisher outpaces the rebuild, then shrinks as the rebuild catches up after passing the publisher's typical lead. The sparkline shows the rebuild sequence climbing faster than the HWM line. State badge stays Running throughout.
How to verifyGap value over the run: starts at ~HWM-0, climbs (gap grows), peaks, then shrinks to a steady-state lag that depends on publisher rate vs rebuild rate. Both lines (sequence + HWM) should be visible on the sparkline. SQL: poll both last_seq_id on the projection row and last_seq_id on the HighWaterMark row; both advance, projection eventually catches up.
Why it mattersA real production rebuild rarely happens in a quiet system. The UI has to keep the operator oriented — am I making net progress or falling further behind? — even with both numbers moving.

Open items / not-yet-implemented

The following are expected scenarios per #314 but require features that don't fully exist yet:

  • Cancellable rebuild from the UI — there's no Cancel action on the Rebuild button today. Tracked under #309. Once added, an LR-9 test step covers it.
  • Progress % / ETA display during rebuild — the sparkline gives shape but no explicit % progress / ETA. Tracked under #309. Once added, LR-10 covers it.
  • Per-tenant rebuild visualization — the per-tenant view selector exists (Phase 3a / #209) but the per-tenant rebuild progress rendering needs design. Cross-reference #309.

Add new test steps to this file as those features land; reference the PR + issue they came from.

Cross-reference

Released under the MIT License.