Alert System
The CritterWatch alert system evaluates incoming telemetry from monitored services against configurable thresholds and manages a fully event-sourced alert lifecycle.
Alert Evaluators
Alert evaluation is performed by IAlertEvaluator implementations that run as part of the Wolverine message handler pipeline. Each evaluator is responsible for a specific category of alert:
| Evaluator | Trigger |
|---|---|
DeadLetterAlertEvaluator | DLQ count thresholds per service/message type |
ProjectionLagAlertEvaluator | Projection lag and stall detection |
AgentHealthAlertEvaluator | Consecutive unhealthy agent reports |
CircuitBreakerAlertEvaluator | Circuit breaker state transitions |
BackPressureAlertEvaluator | Back pressure activation |
Alert Records
Alerts are stored as Marten snapshot documents (AlertRecord). Each record tracks:
public class AlertRecord
{
public Guid Id { get; set; }
public string ServiceName { get; set; }
public AlertType Type { get; set; }
public AlertSeverity Severity { get; set; }
public AlertStatus Status { get; set; }
public string Subject { get; set; }
public string Description { get; set; }
public string? Details { get; set; }
public DateTimeOffset RaisedAt { get; set; }
public DateTimeOffset? ResolvedAt { get; set; }
public DateTimeOffset? ClearedAt { get; set; }
public int ConsecutiveCount { get; set; }
public AlertTransition[] Transitions { get; set; }
}Alert Lifecycle Events
The alert system publishes domain events for each transition. These events are handled by the timeline projection and relayed to the browser via SignalR:
AlertRaised— first threshold breachAlertElevated— condition persists beyond escalation periodAlertReduced— condition improving but not resolvedAlertResolved— condition cleared automaticallyAlertAcknowledged— operator acknowledgedAlertSnoozed— operator snoozed for a durationAlertCleared— operator explicitly cleared with optional note
Threshold Hierarchy
Thresholds are resolved with the following priority:
- Message-type specific —
alerts.ForMessageType("BookTrip", ...) - Service specific —
alerts.ForService("trip-service", ...) - Global defaults —
alerts.GlobalDefaults(...)
The most specific matching threshold wins.
AgentHealthAlertEvaluator
The agent health evaluator is the most complex. It tracks consecutive unhealthy check counts per agent and escalates alerts based on duration:
Healthy → (N consecutive unhealthy) → Warning raised
Warning → (N more consecutive unhealthy) → Elevated to Critical
Critical → (healthy reported) → Resolved automaticallyThe counts N for warning and critical are configured via AgentUnhealthyWarningCount and AgentUnhealthyCriticalCount.
Auto-Resolution
System-condition alerts (DLQ counts, projection lag, circuit breakers, back pressure, agent health) auto-resolve when the condition clears. The evaluator detects the clearing condition and publishes an AlertResolved event.
Operational alerts (node ejection, manual DLQ operations) are informational — they transition directly from Raised to Cleared after a configured TTL.
Alert Handler
The AlertCommandHandler processes operator commands:
// Acknowledge
public static AlertAcknowledgedMessage Handle(AcknowledgeAlert cmd, AlertRecord alert)
=> new AlertAcknowledgedMessage(alert.Id, cmd.AcknowledgedBy);
// Snooze
public static AlertSnoozedMessage Handle(SnoozeAlert cmd, AlertRecord alert)
=> new AlertSnoozedMessage(alert.Id, cmd.SnoozeDuration, cmd.SnoozedBy);
// Clear
public static AlertClearedMessage Handle(ClearAlert cmd, AlertRecord alert)
=> new AlertClearedMessage(alert.Id, cmd.Note, cmd.ClearedBy);Each command result is:
- Applied to the
AlertRecorddocument in Marten - Appended to the alert's event history
- Logged to the audit trail
Configuration
Full threshold configuration options:
// Alert thresholds cascade: message-type specific → service specific → global defaults.
//
// Global defaults:
// DeadLetterQueueWarningCount: 10
// DeadLetterQueueCriticalCount: 100
// ProjectionLagWarningSeconds: 30
// ProjectionLagCriticalSeconds: 300
// AgentUnhealthyWarningCount: 2
// AgentUnhealthyCriticalCount: 5
//
// Service-level overrides apply to all alerts for a specific service.
// Per-message-type overrides are the most specific level.
//
// Configure thresholds in Settings > Alert Configuration in the CritterWatch UI.