A dead man switch (dead man's switch / deadman switch) monitoring service for Prometheus and Loki. Detects when expected signals — metrics or log lines — stop arriving. If a cron job silently stops running, a service stops emitting metrics, or a periodic log line disappears, Vigil catches it.
Traditional monitoring catches errors. Vigil catches silence.
Cron jobs and periodic processes can fail silently — they simply don't run, producing no logs and no errors. Log-based alerts and error alerts can't catch the absence of a signal. Vigil watches for expected signals and raises an alert when they go missing.
Use Vigil when:
- A cron job should run every hour but you'd only notice days later if it stopped
- A nightly report should complete between 2am-4am
- A data sync should happen at regular intervals but the schedule isn't fixed
- You want to auto-discover recurring patterns in your logs and alert if they disappear
Your Apps Vigil (:8080)
push logs ──────► Loki ◄──────── queries Loki (LogQL)
/metrics ──────► Prometheus ◄── queries Prometheus (PromQL)
│
▼
evaluates every 30s
updates dms_switch_status
│
┌─────────┴─────────┐
▼ ▼
Prometheus ◄── /metrics Built-in alerts
│ │
▼ ▼
Grafana alerts Slack / Discord /
Webhook / PagerDuty /
Telegram
Vigil supports two alerting paths:
- Built-in alerts — configure Slack, Discord, Webhook, PagerDuty, or Telegram channels directly in the Vigil UI. Alerts fire on state changes.
- Prometheus metrics — Vigil exposes
dms_switch_statusas a Prometheus gauge. Use your existing Grafana alerting for routing, silencing, and escalation.
Add this to your existing monitoring docker-compose.yml:
vigil:
image: shubhankarmohan/vigil:latest
ports:
- "8080:8080"
volumes:
- vigil-data:/data
- ./vigil.yml:/etc/vigil/vigil.yml:ro
depends_on:
- prometheus
- loki
volumes:
vigil-data:Add to your prometheus.yml:
scrape_configs:
# ... your existing scrape configs ...
- job_name: 'vigil'
static_configs:
- targets: ['vigil:8080']
scrape_interval: 15sOne alert rule covers all switches:
Query: dms_switch_status == 0
For: 0s (Vigil already applies grace periods)
Labels: name = {{ $labels.name }}, mode = {{ $labels.mode }}
Annotation: Switch {{ $labels.name }} is DOWN
Route this to your existing contact points (Slack, PagerDuty, email, etc.).
Go to http://localhost:8080 and create your first switch.
Vigil reads configuration from a YAML file. Copy the example and adjust for your environment:
cp vigil.yml.example vigil.ymlvigil.yml:
# Prometheus connection
prometheus_url: http://prometheus:9090
# prometheus_user: admin
# prometheus_password: secret
# Loki connection
loki_url: http://loki:3100
# loki_user: admin
# loki_password: secret
# Grafana (optional — for annotations on state changes)
# grafana_url: http://grafana:3000
# grafana_api_token: your-service-account-token
# Evaluation engine
eval_interval: 30s
# HTTP server
listen_addr: ":8080"
# SQLite database path
db_path: /data/vigil.dbConfig file is searched in order: CONFIG_FILE env var → ./vigil.yml → /etc/vigil/vigil.yml.
Environment variables can still override any YAML value:
| Variable | YAML key | Default | Description |
|---|---|---|---|
PROMETHEUS_URL |
prometheus_url |
http://prometheus:9090 |
Prometheus server URL |
PROMETHEUS_USER |
prometheus_user |
(empty) | Basic auth username for Prometheus |
PROMETHEUS_PASSWORD |
prometheus_password |
(empty) | Basic auth password for Prometheus |
LOKI_URL |
loki_url |
http://loki:3100 |
Loki server URL |
LOKI_USER |
loki_user |
(empty) | Basic auth username for Loki |
LOKI_PASSWORD |
loki_password |
(empty) | Basic auth password for Loki |
GRAFANA_URL |
grafana_url |
(empty) | Grafana URL (optional, for annotations) |
GRAFANA_API_TOKEN |
grafana_api_token |
(empty) | Grafana API token (optional) |
EVAL_INTERVAL |
eval_interval |
30s |
How often to evaluate all switches |
LISTEN_ADDR |
listen_addr |
:8080 |
HTTP server listen address |
DB_PATH |
db_path |
/data/vigil.db |
SQLite database file path |
If Loki is behind a reverse proxy (nginx), Vigil needs these endpoints exposed:
# Required
location /loki/api/v1/query { proxy_pass http://loki:3100; }
location /loki/api/v1/query_range { proxy_pass http://loki:3100; }
# Optional (for auto-discovery)
location /loki/api/v1/patterns { proxy_pass http://loki:3100; }For signals expected at a known interval. Configure:
- Interval: expected every N seconds (e.g., 3600 = every hour)
- Grace period: extra time before alerting (e.g., 300 = 5 min)
- Time window (optional): only monitor during specific hours (e.g., 09:00-17:00)
Prometheus example — watch a gauge that stores a unix timestamp:
Query: cron_last_run_timestamp{cron_name="sync_awb"}
Mode: frequency
Interval: 3600 (every 1 hour)
Grace: 300 (5 min grace)
Loki example — watch for a specific log line:
Query: {job="diagonAlleyBE_prod"} |= "[CRON] sync_awb completed"
Mode: frequency
Interval: 3600
Grace: 300
For signals that occur at irregular but roughly predictable intervals. Vigil learns the pattern from historical data and alerts when the signal is overdue.
- Min samples: how many data points to collect before activating (default: 4)
- Tolerance multiplier: how many times the median interval before alerting (default: 2x)
Query: {job="myapp"} |= "batch processing complete"
Mode: irregularity
Min samples: 4
Tolerance: 2.0
Vigil computes the median interval between occurrences and alerts if elapsed > tolerance * median.
NEW ──── first signal ──── UP
│
signal │ no signal within
arrives │ expected window
│ │
│ ▼
└──── GRACE
│
signal │ grace period
arrives │ expires
│ │
▼ ▼
UP DOWN ── signal arrives ── UP
LEARNING: Irregularity mode — collecting initial data points.
PAUSED: Manually paused. No evaluation.
| Metric | Type | Labels | Description |
|---|---|---|---|
dms_switch_status |
gauge | name, mode, signal | 1 = healthy, 0 = violated |
dms_last_signal_timestamp |
gauge | name | Unix timestamp of last signal |
dms_expected_at_timestamp |
gauge | name | Unix timestamp of next expected signal |
dms_state_duration_seconds |
gauge | name, state | Seconds in current state |
dms_eval_total |
counter | name, result | Evaluation count (pass/fail) |
This scans Loki every hour for patterns matching [CRON]* in the specified job, and auto-creates irregularity-mode switches for any recurring patterns found.
# Pull from Docker Hub
docker pull shubhankarmohan/vigil:latest
# Run standalone
docker run -d \
--name vigil \
-p 8080:8080 \
-v vigil-data:/data \
-v ./vigil.yml:/etc/vigil/vigil.yml:ro \
shubhankarmohan/vigil:latestTo build from source instead:
docker build -t vigil .# Prerequisites: Go 1.23+, Node 18+
# Run backend
DB_PATH=./vigil.db \
PROMETHEUS_URL=https://metrics.example.com \
PROMETHEUS_USER=admin \
PROMETHEUS_PASSWORD=secret \
LOKI_URL=https://logs.example.com \
LOKI_USER=admin \
LOKI_PASSWORD=secret \
LISTEN_ADDR=:8181 \
go run ./cmd/vigil
# Run frontend (separate terminal)
cd web
npm install
npm run dev
# Opens at http://localhost:5173, proxies API to :8181