Vigil — Dead Man's Switch for Monitoring

A dead man switch (dead man's switch / deadman switch) monitoring service for Prometheus and Loki. Detects when expected signals — metrics or log lines — stop arriving. If a cron job silently stops running, a service stops emitting metrics, or a periodic log line disappears, Vigil catches it.

Traditional monitoring catches errors. Vigil catches silence.

Why

Cron jobs and periodic processes can fail silently — they simply don't run, producing no logs and no errors. Log-based alerts and error alerts can't catch the absence of a signal. Vigil watches for expected signals and raises an alert when they go missing.

Use Vigil when:

A cron job should run every hour but you'd only notice days later if it stopped
A nightly report should complete between 2am-4am
A data sync should happen at regular intervals but the schedule isn't fixed
You want to auto-discover recurring patterns in your logs and alert if they disappear

How It Works

Your Apps                          Vigil (:8080)
  push logs ──────► Loki ◄──────── queries Loki (LogQL)
  /metrics  ──────► Prometheus ◄── queries Prometheus (PromQL)
                                      │
                                      ▼
                                   evaluates every 30s
                                   updates dms_switch_status
                                      │
                            ┌─────────┴─────────┐
                            ▼                   ▼
                    Prometheus ◄── /metrics   Built-in alerts
                            │                   │
                            ▼                   ▼
                    Grafana alerts       Slack / Discord /
                                        Webhook / PagerDuty /
                                        Telegram

Vigil supports two alerting paths:

Built-in alerts — configure Slack, Discord, Webhook, PagerDuty, or Telegram channels directly in the Vigil UI. Alerts fire on state changes.
Prometheus metrics — Vigil exposes dms_switch_status as a Prometheus gauge. Use your existing Grafana alerting for routing, silencing, and escalation.

Quick Start

1. Add Vigil to your docker-compose

Add this to your existing monitoring docker-compose.yml:

  vigil:
    image: shubhankarmohan/vigil:latest
    ports:
      - "8080:8080"
    volumes:
      - vigil-data:/data
      - ./vigil.yml:/etc/vigil/vigil.yml:ro
    depends_on:
      - prometheus
      - loki

volumes:
  vigil-data:

2. Add Vigil scrape target to Prometheus

Add to your prometheus.yml:

scrape_configs:
  # ... your existing scrape configs ...

  - job_name: 'vigil'
    static_configs:
      - targets: ['vigil:8080']
    scrape_interval: 15s

3. Create a Grafana alert rule

One alert rule covers all switches:

Query:        dms_switch_status == 0
For:          0s   (Vigil already applies grace periods)
Labels:       name = {{ $labels.name }}, mode = {{ $labels.mode }}
Annotation:   Switch {{ $labels.name }} is DOWN

Route this to your existing contact points (Slack, PagerDuty, email, etc.).

4. Open the UI and create switches

Go to http://localhost:8080 and create your first switch.

Configuration

Vigil reads configuration from a YAML file. Copy the example and adjust for your environment:

cp vigil.yml.example vigil.yml

vigil.yml:

# Prometheus connection
prometheus_url: http://prometheus:9090
# prometheus_user: admin
# prometheus_password: secret

# Loki connection
loki_url: http://loki:3100
# loki_user: admin
# loki_password: secret

# Grafana (optional — for annotations on state changes)
# grafana_url: http://grafana:3000
# grafana_api_token: your-service-account-token

# Evaluation engine
eval_interval: 30s

# HTTP server
listen_addr: ":8080"

# SQLite database path
db_path: /data/vigil.db

Config file is searched in order: CONFIG_FILE env var → ./vigil.yml → /etc/vigil/vigil.yml.

Environment variables can still override any YAML value:

Variable	YAML key	Default	Description
`PROMETHEUS_URL`	`prometheus_url`	`http://prometheus:9090`	Prometheus server URL
`PROMETHEUS_USER`	`prometheus_user`	(empty)	Basic auth username for Prometheus
`PROMETHEUS_PASSWORD`	`prometheus_password`	(empty)	Basic auth password for Prometheus
`LOKI_URL`	`loki_url`	`http://loki:3100`	Loki server URL
`LOKI_USER`	`loki_user`	(empty)	Basic auth username for Loki
`LOKI_PASSWORD`	`loki_password`	(empty)	Basic auth password for Loki
`GRAFANA_URL`	`grafana_url`	(empty)	Grafana URL (optional, for annotations)
`GRAFANA_API_TOKEN`	`grafana_api_token`	(empty)	Grafana API token (optional)
`EVAL_INTERVAL`	`eval_interval`	`30s`	How often to evaluate all switches
`LISTEN_ADDR`	`listen_addr`	`:8080`	HTTP server listen address
`DB_PATH`	`db_path`	`/data/vigil.db`	SQLite database file path

Loki Endpoints Required

If Loki is behind a reverse proxy (nginx), Vigil needs these endpoints exposed:

# Required
location /loki/api/v1/query { proxy_pass http://loki:3100; }
location /loki/api/v1/query_range { proxy_pass http://loki:3100; }

# Optional (for auto-discovery)
location /loki/api/v1/patterns { proxy_pass http://loki:3100; }

Detection Modes

Frequency Mode

For signals expected at a known interval. Configure:

Interval: expected every N seconds (e.g., 3600 = every hour)
Grace period: extra time before alerting (e.g., 300 = 5 min)
Time window (optional): only monitor during specific hours (e.g., 09:00-17:00)

Prometheus example — watch a gauge that stores a unix timestamp:

Query:    cron_last_run_timestamp{cron_name="sync_awb"}
Mode:     frequency
Interval: 3600   (every 1 hour)
Grace:    300    (5 min grace)

Loki example — watch for a specific log line:

Query:    {job="diagonAlleyBE_prod"} |= "[CRON] sync_awb completed"
Mode:     frequency
Interval: 3600
Grace:    300

Irregularity Mode

For signals that occur at irregular but roughly predictable intervals. Vigil learns the pattern from historical data and alerts when the signal is overdue.

Min samples: how many data points to collect before activating (default: 4)
Tolerance multiplier: how many times the median interval before alerting (default: 2x)

Query:         {job="myapp"} |= "batch processing complete"
Mode:          irregularity
Min samples:   4
Tolerance:     2.0

Vigil computes the median interval between occurrences and alerts if elapsed > tolerance * median.

Switch States

    NEW ──── first signal ──── UP
                                │
                      signal    │  no signal within
                      arrives   │  expected window
                        │       │
                        │       ▼
                        └──── GRACE
                                │
                      signal    │  grace period
                      arrives   │  expires
                        │       │
                        ▼       ▼
                       UP     DOWN ── signal arrives ── UP

    LEARNING:  Irregularity mode — collecting initial data points.
    PAUSED:    Manually paused. No evaluation.

Exposed Prometheus Metrics

Metric	Type	Labels	Description
`dms_switch_status`	gauge	name, mode, signal	1 = healthy, 0 = violated
`dms_last_signal_timestamp`	gauge	name	Unix timestamp of last signal
`dms_expected_at_timestamp`	gauge	name	Unix timestamp of next expected signal
`dms_state_duration_seconds`	gauge	name, state	Seconds in current state
`dms_eval_total`	counter	name, result	Evaluation count (pass/fail)

This scans Loki every hour for patterns matching [CRON]* in the specified job, and auto-creates irregularity-mode switches for any recurring patterns found.

Docker

# Pull from Docker Hub
docker pull shubhankarmohan/vigil:latest

# Run standalone
docker run -d \
  --name vigil \
  -p 8080:8080 \
  -v vigil-data:/data \
  -v ./vigil.yml:/etc/vigil/vigil.yml:ro \
  shubhankarmohan/vigil:latest

To build from source instead:

docker build -t vigil .

Development

# Prerequisites: Go 1.23+, Node 18+

# Run backend
DB_PATH=./vigil.db \
PROMETHEUS_URL=https://metrics.example.com \
PROMETHEUS_USER=admin \
PROMETHEUS_PASSWORD=secret \
LOKI_URL=https://logs.example.com \
LOKI_USER=admin \
LOKI_PASSWORD=secret \
LISTEN_ADDR=:8181 \
go run ./cmd/vigil

# Run frontend (separate terminal)
cd web
npm install
npm run dev
# Opens at http://localhost:5173, proxies API to :8181

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cmd/vigil		cmd/vigil
deploy		deploy
internal		internal
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
vigil.yml.example		vigil.yml.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vigil — Dead Man's Switch for Monitoring

Why

How It Works

Quick Start

1. Add Vigil to your docker-compose

2. Add Vigil scrape target to Prometheus

3. Create a Grafana alert rule

4. Open the UI and create switches

Configuration

Loki Endpoints Required

Detection Modes

Frequency Mode

Irregularity Mode

Switch States

Exposed Prometheus Metrics

Docker

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vigil — Dead Man's Switch for Monitoring

Why

How It Works

Quick Start

1. Add Vigil to your docker-compose

2. Add Vigil scrape target to Prometheus

3. Create a Grafana alert rule

4. Open the UI and create switches

Configuration

Loki Endpoints Required

Detection Modes

Frequency Mode

Irregularity Mode

Switch States

Exposed Prometheus Metrics

Docker

Development

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages