Skip to content

Alert System

The CritterWatch alert system evaluates incoming telemetry from monitored services against configurable thresholds and manages a fully event-sourced alert lifecycle.

Alert Evaluators

Alert evaluation is performed by IAlertEvaluator implementations that run as part of the Wolverine message handler pipeline. Each evaluator is responsible for a specific category of alert:

EvaluatorTrigger
DeadLetterAlertEvaluatorDLQ count thresholds per service/message type
ProjectionLagAlertEvaluatorProjection lag and stall detection
AgentHealthAlertEvaluatorConsecutive unhealthy agent reports
CircuitBreakerAlertEvaluatorCircuit breaker state transitions
BackPressureAlertEvaluatorBack pressure activation

Alert Records

Alerts are stored as Marten snapshot documents (AlertRecord). Each record tracks:

csharp
public class AlertRecord
{
    public Guid Id { get; set; }
    public string ServiceName { get; set; }
    public AlertType Type { get; set; }
    public AlertSeverity Severity { get; set; }
    public AlertStatus Status { get; set; }
    public string Subject { get; set; }
    public string Description { get; set; }
    public string? Details { get; set; }
    public DateTimeOffset RaisedAt { get; set; }
    public DateTimeOffset? ResolvedAt { get; set; }
    public DateTimeOffset? ClearedAt { get; set; }
    public int ConsecutiveCount { get; set; }
    public AlertTransition[] Transitions { get; set; }
}

Alert Lifecycle Events

The alert system publishes domain events for each transition. These events are handled by the timeline projection and relayed to the browser via SignalR:

  • AlertRaised — first threshold breach
  • AlertElevated — condition persists beyond escalation period
  • AlertReduced — condition improving but not resolved
  • AlertResolved — condition cleared automatically
  • AlertAcknowledged — operator acknowledged
  • AlertSnoozed — operator snoozed for a duration
  • AlertCleared — operator explicitly cleared with optional note

Threshold Hierarchy

Thresholds are resolved with the following priority:

  1. Message-type specificalerts.ForMessageType("BookTrip", ...)
  2. Service specificalerts.ForService("trip-service", ...)
  3. Global defaultsalerts.GlobalDefaults(...)

The most specific matching threshold wins.

AgentHealthAlertEvaluator

The agent health evaluator is the most complex. It tracks consecutive unhealthy check counts per agent and escalates alerts based on duration:

Healthy → (N consecutive unhealthy) → Warning raised
Warning → (N more consecutive unhealthy) → Elevated to Critical
Critical → (healthy reported) → Resolved automatically

The counts N for warning and critical are configured via AgentUnhealthyWarningCount and AgentUnhealthyCriticalCount.

Auto-Resolution

System-condition alerts (DLQ counts, projection lag, circuit breakers, back pressure, agent health) auto-resolve when the condition clears. The evaluator detects the clearing condition and publishes an AlertResolved event.

Operational alerts (node ejection, manual DLQ operations) are informational — they transition directly from Raised to Cleared after a configured TTL.

Alert Handler

The AlertCommandHandler processes operator commands:

csharp
// Acknowledge
public static AlertAcknowledgedMessage Handle(AcknowledgeAlert cmd, AlertRecord alert)
    => new AlertAcknowledgedMessage(alert.Id, cmd.AcknowledgedBy);

// Snooze
public static AlertSnoozedMessage Handle(SnoozeAlert cmd, AlertRecord alert)
    => new AlertSnoozedMessage(alert.Id, cmd.SnoozeDuration, cmd.SnoozedBy);

// Clear
public static AlertClearedMessage Handle(ClearAlert cmd, AlertRecord alert)
    => new AlertClearedMessage(alert.Id, cmd.Note, cmd.ClearedBy);

Each command result is:

  1. Applied to the AlertRecord document in Marten
  2. Appended to the alert's event history
  3. Logged to the audit trail

Configuration

Full threshold configuration options:

// Alert thresholds cascade: message-type specific → service specific → global defaults.
//
// Global defaults:
//   DeadLetterQueueWarningCount: 10
//   DeadLetterQueueCriticalCount: 100
//   ProjectionLagWarningSeconds: 30
//   ProjectionLagCriticalSeconds: 300
//   AgentUnhealthyWarningCount: 2
//   AgentUnhealthyCriticalCount: 5
//
// Service-level overrides apply to all alerts for a specific service.
// Per-message-type overrides are the most specific level.
//
// Configure thresholds in Settings > Alert Configuration in the CritterWatch UI.

Released under the MIT License.