Automated snapshot creation service for Cosmos SDK chains with built-in monitoring and beautiful web UI.
This service automates the process of creating chain snapshots for Cosmos SDK nodes:
- Starts a backup node
- Waits for it to sync
- Stops the node
- Prunes the database (optional)
- Creates a compressed snapshot
- Cleans up old snapshots
- Repeats on configured interval
Key Features:
- β Multi-chain support - One Docker image for all chains (Injective, Osmosis, Cosmos Hub, Juno, etc.)
- β Beautiful web UI - Modern landing page to browse and download snapshots
- β Full monitoring via Prometheus metrics
- β Health endpoints for service checks
- β Zabbix integration with smart alerting
- β Flag-based config - Command-line flags with env var fallback
- β Auto-organized backups - Snapshots automatically sectioned by chain
- β HashiCorp Nomad ready with Consul integration
- β Structured logging (JSON or console)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β One Docker Image β
β ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ β
β βinjectivedβ osmosisd β gaiad β junod βcosmprund β β
β ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Snapshot Service (Go Binary) β β
β β β’ Start node β Sync β Stop β Prune β Backup β β
β β β’ HTTP Health & Metrics β β
β β β’ Auto-organize by chain β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Shared Volume: /backups/ β
β β
β βββ injective-1/ β
β β βββ snapshots.tar.gz β
β βββ osmosis-1/ β
β β βββ snapshots.tar.gz β
β βββ cosmoshub-4/ β
β βββ snapshots.tar.gz β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Nginx (Beautiful Web UI) β
β β
β π Cosmos Snapshot Service β
β ββββββββ ββββββββ ββββββββ β
β β INJ β β OSMO β β ATOM β β
β ββββββββ ββββββββ ββββββββ β
βββββββββββββββββββββββββββββββββββββββ
Deployment Model:
- Per Chain: One job/container instance per blockchain
- Shared: Single nginx serves all chains via one web interface
- Scalable: Add new chains by deploying new instances (same image!)
Problems with bash script:
- β No way to monitor service health
- β Can't distinguish "creating snapshot" from "crashed"
- β Zabbix templates alert when node stops (which is expected!)
- β All errors silenced (
|| trueeverywhere) - β No metrics on snapshot size, duration, success rate
This service solves:
- β Dedicated health endpoint - know if service is running
- β Separate metrics for node vs snapshot service
- β Zabbix template that understands backup workflow
- β Proper error handling and reporting
- β Detailed metrics and observability
The Docker image includes binaries for multiple chains:
docker build -t cosmos-snapshot-service:latest .Included chains:
- Injective (injectived)
- Osmosis (osmosisd)
- Cosmos Hub (gaiad)
- Juno (junod)
Using flags (recommended):
snapshot-service \
--chain-id injective-1 \
--binary injectived \
--home-dir /data/.injectivedUsing environment variables:
export CHAIN_ID=injective-1
export BINARY=injectived
export HOME_DIR=/data/.injectived
snapshot-serviceUsing Docker:
docker run -v /data:/data \
cosmos-snapshot-service:latest \
--chain-id injective-1 \
--binary injectived \
--home-dir /data/.injectivedUse the included snapshot.nomad job file:
nomad job run snapshot.nomadThe Nomad job includes:
- snapshot-service task - Creates snapshots automatically
- nginx task - Serves snapshots via beautiful web UI
Navigate to http://your-host/ to see the snapshot landing page with all available chains.
- Open Zabbix UI β Data collection β Templates
- Click Import
- Upload
zabbix_template_snapshot.yaml - Create host with this template
- Set macros:
{HOST.IP},{$HEALTH_PORT},{$METRICS_PORT}
# Check health
curl http://localhost:8080/health
# Check status
curl http://localhost:8080/status
# Check metrics
curl http://localhost:9090/metrics
# Browse snapshots (if nginx is running)
curl http://localhost/Configuration via command-line flags (preferred) or environment variables (fallback).
Priority: Flags > Environment Variables > Defaults
Run snapshot-service --help for full documentation.
Required:
--chain-id string Chain identifier (e.g., injective-1)
--binary string Chain binary name (e.g., injectived)
--home-dir string Chain home directorySnapshot Configuration:
--snapshot-interval duration Snapshot interval (default: 24h)
--backup-dir string Backup directory (default: {HOME_DIR}/backups/{CHAIN_ID})
--retention-count int Number of snapshots to keep (default: 2)
--prune Run cosmprund before backup (default: true)Node Configuration:
--rpc-endpoint string RPC endpoint (default: http://localhost:26657)
--sync-timeout duration Max sync wait time (default: 24h)Monitoring:
--health-port string Health port (default: 8080)
--metrics-port string Metrics port (default: 9090)
--zabbix-server string Zabbix server address (optional)
--zabbix-host string Zabbix hostname (optional)Logging:
--log-level string Level: debug, info, warn, error (default: info)
--log-format string Format: json, console (default: json)All flags map to environment variables (uppercase with underscores):
| Flag | Environment Variable | Example |
|---|---|---|
--chain-id |
CHAIN_ID |
injective-1 |
--binary |
BINARY |
injectived |
--home-dir |
HOME_DIR |
/data/.injectived |
--snapshot-interval |
SNAPSHOT_INTERVAL |
24h |
--backup-dir |
BACKUP_DIR |
/backups/injective-1 |
--retention-count |
RETENTION_COUNT |
3 |
--prune |
PRUNE_BEFORE_BACKUP |
true |
--rpc-endpoint |
RPC_ENDPOINT |
http://localhost:26657 |
--sync-timeout |
SYNC_TIMEOUT |
24h |
--health-port |
HEALTH_PORT |
8080 |
--metrics-port |
METRICS_PORT |
9090 |
--log-level |
LOG_LEVEL |
info |
--log-format |
LOG_FORMAT |
json |
Snapshots are automatically organized by chain:
/data/.injectived/
βββ backups/
βββ injective-1/
βββ injective-1-12345-2024-01-15T10-00-00.tar.gz
βββ injective-1-12346-2024-01-16T10-00-00.tar.gz
/data/.osmosisd/
βββ backups/
βββ osmosis-1/
βββ osmosis-1-98765-2024-01-15T10-00-00.tar.gz
This allows a single nginx instance to serve all chains from one directory structure.
GET /health - Simple health check
{
"status": "healthy"
}GET /status - Detailed status
{
"status": "healthy",
"chain_id": "injective-1",
"last_backup_time": "2024-01-15T10:30:00Z",
"node_running": false,
"creating_backup": false,
"next_backup_in": "20h15m30s",
"snapshot_interval": "24h"
}Available at http://localhost:9090/metrics:
| Metric | Type | Description |
|---|---|---|
snapshot_success_total |
Counter | Total successful snapshots |
snapshot_error_total |
Counter | Total failed snapshots |
snapshot_last_success_timestamp |
Gauge | Unix timestamp of last success |
snapshot_duration_seconds |
Histogram | Time to create snapshot |
snapshot_size_megabytes |
Gauge | Size of last snapshot (MB) |
snapshot_service_health_status |
Gauge | Service health (1=healthy, 0=unhealthy) |
snapshot_node_running |
Gauge | Node currently running (1=yes, 0=no) |
snapshot_backup_creating |
Gauge | Backup being created (1=yes, 0=no) |
The included Zabbix template monitors:
β What it monitors:
- Snapshot creation success/failure
- Time since last successful snapshot
- Service health (separate from node health!)
- Snapshot size trends
π― Smart alerting:
- β Alerts if no snapshot in 24-48 hours
- β Alerts if snapshot creation fails
- β Alerts if service becomes unhealthy
- β Does NOT alert when node stops (expected behavior!)
Key difference from normal node monitoring:
- Normal template: Alerts when node stops β
- This template: Understands node stops for backups β
[Service Running] βββββββββββββββββββββββββββββ> Time
β
ββ> Start Node βββββββββββββββββββββ> Node Running
β
ββ> Wait for Sync (polling /status)
β
ββ> Stop Node βββββββββββββββββββββ> Node Stopped (EXPECTED!)
β
ββ> Prune Database (cosmprund)
β
ββ> Create Backup (tar.gz) βββββββ> Backup Creating
β
ββ> Cleanup Old Backups
β
ββ> Update Metrics
β
ββ> Wait {SNAPSHOT_INTERVAL} βββββ> Repeat
| State | Node Running | Creating Backup | Health |
|---|---|---|---|
| Syncing | β Yes | β No | β Healthy |
| Backing up | β No | β Yes | β Healthy |
| Waiting | β No | β No | β Healthy |
| Failed | β No | β No | β Unhealthy |
Service is unhealthy if:
- Last snapshot failed
- No snapshot in 2x interval (e.g., 48h if interval is 24h)
Service is healthy even when:
- Node is stopped (normal during backup)
- Waiting between snapshots
| Aspect | Bash Script | This Service |
|---|---|---|
| Monitoring | Zabbix sender only | Full Prometheus + Zabbix |
| Health Check | None | HTTP endpoints |
| Error Handling | Silenced (|| true) |
Proper errors + metrics |
| Status Visibility | None | Real-time status endpoint |
| Alerting | Node stop = alert β | Smart alerting β |
| Metrics | None | Size, duration, success rate |
| Logs | Plain text | Structured JSON |
| Service Health | Can't tell if crashed | Monitored continuously |
One Docker image supports all chains:
- The Docker image contains binaries for Injective, Osmosis, Cosmos Hub, Juno, etc.
- Each chain deployment is a separate job/instance
- All share the same nginx web UI
Recommended setup (Nomad):
Job: injective-snapshot β Uses injectived binary
Job: osmosis-snapshot β Uses osmosisd binary
Job: nginx-snapshots β Serves all chains via web UI
1. Deploy Injective:
nomad job run injective-snapshot.nomad2. Deploy Osmosis:
nomad job run osmosis-snapshot.nomad3. Deploy Nginx (shared):
nomad job run nginx-snapshots.nomadEach chain writes to its own subdirectory automatically, and nginx serves them all from one beautiful landing page.
/shared/backups/
βββ injective-1/
β βββ injective-1-12345-2024-01-15.tar.gz
β βββ injective-1-12346-2024-01-16.tar.gz
βββ osmosis-1/
β βββ osmosis-1-98765-2024-01-15.tar.gz
β βββ osmosis-1-98766-2024-01-16.tar.gz
βββ cosmoshub-4/
βββ cosmoshub-4-45678-2024-01-15.tar.gz
The web UI at http://your-host/ automatically discovers and displays all available chains!
Use the included snapshot.nomad file:
- Configs from Consul KV or flags
- Automatic restart on failure
- Service discovery
- Health checks
- Shared nginx for web UI
Check:
curl http://localhost:8080/statusLook for last_backup_error field.
Common causes:
- Node won't sync (check RPC endpoint)
- Disk space full (check backup directory)
- cosmprund not installed
- Permissions issues
Check logs:
# Nomad
nomad alloc logs <alloc-id>
# Docker
docker logs <container-id>
# Systemd
journalctl -u cosmos-snapshot -fVerify:
- Binary path correct (
which injectived) - Home directory accessible
- Backup directory writable
Verify Zabbix config:
# Test zabbix_sender manually
zabbix_sender -z <zabbix-server> -s <host> -k snapshot.injective-1.status -o 1Check:
ZABBIX_SERVERandZABBIX_HOSTset correctly- zabbix_sender installed in container
- Firewall allows connection to Zabbix server
go mod download
go build -o snapshot-service .Using flags (recommended):
./snapshot-service \
--chain-id test \
--binary echo \
--home-dir /tmp/test \
--snapshot-interval 5m \
--log-format console \
--log-level debugUsing environment variables:
export CHAIN_ID=test
export BINARY=echo # Use echo for testing without real node
export HOME_DIR=/tmp/test
export SNAPSHOT_INTERVAL=5m
export LOG_FORMAT=console
./snapshot-serviceSet --binary echo (or BINARY=echo) and use /tmp/test for testing the service workflow without a real blockchain node.
# View all available flags
./snapshot-service --help
# Test help during development
go run . --help#!/bin/bash
while true; do
injectived start &
# wait for sync...
kill $PID
cosmprund prune /data
tar -czf backup.tar.gz /data
sleep 86400
doneJust deploy and configure. Same workflow, but with:
- β Monitoring
- β Proper error handling
- β Health checks
- β Metrics
- β Smart Zabbix alerting
MIT
For issues or questions:
- Check logs first
- Review metrics/health endpoints
- Verify configuration
- Test with
LOG_LEVEL=debug