Skip to content

dpdanpittman/cosmos-node-snapshots

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cosmos Snapshot Service

Automated snapshot creation service for Cosmos SDK chains with built-in monitoring and beautiful web UI.

Overview

This service automates the process of creating chain snapshots for Cosmos SDK nodes:

  1. Starts a backup node
  2. Waits for it to sync
  3. Stops the node
  4. Prunes the database (optional)
  5. Creates a compressed snapshot
  6. Cleans up old snapshots
  7. Repeats on configured interval

Key Features:

  • βœ… Multi-chain support - One Docker image for all chains (Injective, Osmosis, Cosmos Hub, Juno, etc.)
  • βœ… Beautiful web UI - Modern landing page to browse and download snapshots
  • βœ… Full monitoring via Prometheus metrics
  • βœ… Health endpoints for service checks
  • βœ… Zabbix integration with smart alerting
  • βœ… Flag-based config - Command-line flags with env var fallback
  • βœ… Auto-organized backups - Snapshots automatically sectioned by chain
  • βœ… HashiCorp Nomad ready with Consul integration
  • βœ… Structured logging (JSON or console)

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    One Docker Image                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚injectivedβ”‚ osmosisd β”‚  gaiad   β”‚  junod   β”‚cosmprund β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚         Snapshot Service (Go Binary)                β”‚   β”‚
β”‚  β”‚  β€’ Start node β†’ Sync β†’ Stop β†’ Prune β†’ Backup       β”‚   β”‚
β”‚  β”‚  β€’ HTTP Health & Metrics                            β”‚   β”‚
β”‚  β”‚  β€’ Auto-organize by chain                           β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Shared Volume: /backups/          β”‚
         β”‚                                      β”‚
         β”‚   β”œβ”€β”€ injective-1/                  β”‚
         β”‚   β”‚   └── snapshots.tar.gz          β”‚
         β”‚   β”œβ”€β”€ osmosis-1/                    β”‚
         β”‚   β”‚   └── snapshots.tar.gz          β”‚
         β”‚   └── cosmoshub-4/                  β”‚
         β”‚       └── snapshots.tar.gz          β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚    Nginx (Beautiful Web UI)         β”‚
         β”‚                                      β”‚
         β”‚  πŸš€ Cosmos Snapshot Service         β”‚
         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”        β”‚
         β”‚  β”‚ INJ  β”‚ β”‚ OSMO β”‚ β”‚ ATOM β”‚        β”‚
         β”‚  β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deployment Model:

  • Per Chain: One job/container instance per blockchain
  • Shared: Single nginx serves all chains via one web interface
  • Scalable: Add new chains by deploying new instances (same image!)

Why This Over Bash Script?

Problems with bash script:

  • ❌ No way to monitor service health
  • ❌ Can't distinguish "creating snapshot" from "crashed"
  • ❌ Zabbix templates alert when node stops (which is expected!)
  • ❌ All errors silenced (|| true everywhere)
  • ❌ No metrics on snapshot size, duration, success rate

This service solves:

  • βœ… Dedicated health endpoint - know if service is running
  • βœ… Separate metrics for node vs snapshot service
  • βœ… Zabbix template that understands backup workflow
  • βœ… Proper error handling and reporting
  • βœ… Detailed metrics and observability

Quick Start

1. Build Docker Image

The Docker image includes binaries for multiple chains:

docker build -t cosmos-snapshot-service:latest .

Included chains:

  • Injective (injectived)
  • Osmosis (osmosisd)
  • Cosmos Hub (gaiad)
  • Juno (junod)

2. Run the Service

Using flags (recommended):

snapshot-service \
  --chain-id injective-1 \
  --binary injectived \
  --home-dir /data/.injectived

Using environment variables:

export CHAIN_ID=injective-1
export BINARY=injectived
export HOME_DIR=/data/.injectived
snapshot-service

Using Docker:

docker run -v /data:/data \
  cosmos-snapshot-service:latest \
  --chain-id injective-1 \
  --binary injectived \
  --home-dir /data/.injectived

3. Deploy with Nomad

Use the included snapshot.nomad job file:

nomad job run snapshot.nomad

The Nomad job includes:

  • snapshot-service task - Creates snapshots automatically
  • nginx task - Serves snapshots via beautiful web UI

4. Access the Web UI

Navigate to http://your-host/ to see the snapshot landing page with all available chains.

5. Import Zabbix Template

  1. Open Zabbix UI β†’ Data collection β†’ Templates
  2. Click Import
  3. Upload zabbix_template_snapshot.yaml
  4. Create host with this template
  5. Set macros: {HOST.IP}, {$HEALTH_PORT}, {$METRICS_PORT}

6. Verify

# Check health
curl http://localhost:8080/health

# Check status
curl http://localhost:8080/status

# Check metrics
curl http://localhost:9090/metrics

# Browse snapshots (if nginx is running)
curl http://localhost/

Configuration

Configuration via command-line flags (preferred) or environment variables (fallback).

Priority: Flags > Environment Variables > Defaults

Command-Line Flags

Run snapshot-service --help for full documentation.

Required:

--chain-id string        Chain identifier (e.g., injective-1)
--binary string          Chain binary name (e.g., injectived)
--home-dir string        Chain home directory

Snapshot Configuration:

--snapshot-interval duration   Snapshot interval (default: 24h)
--backup-dir string           Backup directory (default: {HOME_DIR}/backups/{CHAIN_ID})
--retention-count int         Number of snapshots to keep (default: 2)
--prune                       Run cosmprund before backup (default: true)

Node Configuration:

--rpc-endpoint string    RPC endpoint (default: http://localhost:26657)
--sync-timeout duration  Max sync wait time (default: 24h)

Monitoring:

--health-port string     Health port (default: 8080)
--metrics-port string    Metrics port (default: 9090)
--zabbix-server string   Zabbix server address (optional)
--zabbix-host string     Zabbix hostname (optional)

Logging:

--log-level string    Level: debug, info, warn, error (default: info)
--log-format string   Format: json, console (default: json)

Environment Variables

All flags map to environment variables (uppercase with underscores):

Flag Environment Variable Example
--chain-id CHAIN_ID injective-1
--binary BINARY injectived
--home-dir HOME_DIR /data/.injectived
--snapshot-interval SNAPSHOT_INTERVAL 24h
--backup-dir BACKUP_DIR /backups/injective-1
--retention-count RETENTION_COUNT 3
--prune PRUNE_BEFORE_BACKUP true
--rpc-endpoint RPC_ENDPOINT http://localhost:26657
--sync-timeout SYNC_TIMEOUT 24h
--health-port HEALTH_PORT 8080
--metrics-port METRICS_PORT 9090
--log-level LOG_LEVEL info
--log-format LOG_FORMAT json

Backup Directory Structure

Snapshots are automatically organized by chain:

/data/.injectived/
  └── backups/
      └── injective-1/
          β”œβ”€β”€ injective-1-12345-2024-01-15T10-00-00.tar.gz
          └── injective-1-12346-2024-01-16T10-00-00.tar.gz

/data/.osmosisd/
  └── backups/
      └── osmosis-1/
          └── osmosis-1-98765-2024-01-15T10-00-00.tar.gz

This allows a single nginx instance to serve all chains from one directory structure.

Monitoring

Health Endpoints

GET /health - Simple health check

{
  "status": "healthy"
}

GET /status - Detailed status

{
  "status": "healthy",
  "chain_id": "injective-1",
  "last_backup_time": "2024-01-15T10:30:00Z",
  "node_running": false,
  "creating_backup": false,
  "next_backup_in": "20h15m30s",
  "snapshot_interval": "24h"
}

Prometheus Metrics

Available at http://localhost:9090/metrics:

Metric Type Description
snapshot_success_total Counter Total successful snapshots
snapshot_error_total Counter Total failed snapshots
snapshot_last_success_timestamp Gauge Unix timestamp of last success
snapshot_duration_seconds Histogram Time to create snapshot
snapshot_size_megabytes Gauge Size of last snapshot (MB)
snapshot_service_health_status Gauge Service health (1=healthy, 0=unhealthy)
snapshot_node_running Gauge Node currently running (1=yes, 0=no)
snapshot_backup_creating Gauge Backup being created (1=yes, 0=no)

Zabbix Monitoring

The included Zabbix template monitors:

βœ… What it monitors:

  • Snapshot creation success/failure
  • Time since last successful snapshot
  • Service health (separate from node health!)
  • Snapshot size trends

🎯 Smart alerting:

  • βœ… Alerts if no snapshot in 24-48 hours
  • βœ… Alerts if snapshot creation fails
  • βœ… Alerts if service becomes unhealthy
  • ❌ Does NOT alert when node stops (expected behavior!)

Key difference from normal node monitoring:

  • Normal template: Alerts when node stops ❌
  • This template: Understands node stops for backups βœ…

How It Works

Normal Operation

[Service Running] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━> Time
    β”‚
    β”œβ”€> Start Node ━━━━━━━━━━━━━━━━━━━━━> Node Running
    β”‚
    β”œβ”€> Wait for Sync (polling /status)
    β”‚
    β”œβ”€> Stop Node ━━━━━━━━━━━━━━━━━━━━━> Node Stopped (EXPECTED!)
    β”‚
    β”œβ”€> Prune Database (cosmprund)
    β”‚
    β”œβ”€> Create Backup (tar.gz) ━━━━━━━> Backup Creating
    β”‚
    β”œβ”€> Cleanup Old Backups
    β”‚
    β”œβ”€> Update Metrics
    β”‚
    └─> Wait {SNAPSHOT_INTERVAL} ━━━━━> Repeat

Health States

State Node Running Creating Backup Health
Syncing βœ… Yes ❌ No βœ… Healthy
Backing up ❌ No βœ… Yes βœ… Healthy
Waiting ❌ No ❌ No βœ… Healthy
Failed ❌ No ❌ No ❌ Unhealthy

Service is unhealthy if:

  • Last snapshot failed
  • No snapshot in 2x interval (e.g., 48h if interval is 24h)

Service is healthy even when:

  • Node is stopped (normal during backup)
  • Waiting between snapshots

Comparison: Bash vs Go Service

Aspect Bash Script This Service
Monitoring Zabbix sender only Full Prometheus + Zabbix
Health Check None HTTP endpoints
Error Handling Silenced (|| true) Proper errors + metrics
Status Visibility None Real-time status endpoint
Alerting Node stop = alert ❌ Smart alerting βœ…
Metrics None Size, duration, success rate
Logs Plain text Structured JSON
Service Health Can't tell if crashed Monitored continuously

Multi-Chain Deployment

Architecture

One Docker image supports all chains:

  • The Docker image contains binaries for Injective, Osmosis, Cosmos Hub, Juno, etc.
  • Each chain deployment is a separate job/instance
  • All share the same nginx web UI

Recommended setup (Nomad):

Job: injective-snapshot β†’ Uses injectived binary
Job: osmosis-snapshot   β†’ Uses osmosisd binary
Job: nginx-snapshots    β†’ Serves all chains via web UI

Example: Adding Multiple Chains

1. Deploy Injective:

nomad job run injective-snapshot.nomad

2. Deploy Osmosis:

nomad job run osmosis-snapshot.nomad

3. Deploy Nginx (shared):

nomad job run nginx-snapshots.nomad

Each chain writes to its own subdirectory automatically, and nginx serves them all from one beautiful landing page.

Directory Structure (Multi-Chain)

/shared/backups/
  β”œβ”€β”€ injective-1/
  β”‚   β”œβ”€β”€ injective-1-12345-2024-01-15.tar.gz
  β”‚   └── injective-1-12346-2024-01-16.tar.gz
  β”œβ”€β”€ osmosis-1/
  β”‚   β”œβ”€β”€ osmosis-1-98765-2024-01-15.tar.gz
  β”‚   └── osmosis-1-98766-2024-01-16.tar.gz
  └── cosmoshub-4/
      └── cosmoshub-4-45678-2024-01-15.tar.gz

The web UI at http://your-host/ automatically discovers and displays all available chains!

Deployment Scenarios

Scenario 1: Nomad (Recommended)

Use the included snapshot.nomad file:

  • Configs from Consul KV or flags
  • Automatic restart on failure
  • Service discovery
  • Health checks
  • Shared nginx for web UI

Troubleshooting

Service shows unhealthy

Check:

curl http://localhost:8080/status

Look for last_backup_error field.

Common causes:

  • Node won't sync (check RPC endpoint)
  • Disk space full (check backup directory)
  • cosmprund not installed
  • Permissions issues

No snapshots being created

Check logs:

# Nomad
nomad alloc logs <alloc-id>

# Docker
docker logs <container-id>

# Systemd
journalctl -u cosmos-snapshot -f

Verify:

  • Binary path correct (which injectived)
  • Home directory accessible
  • Backup directory writable

Snapshots created but Zabbix not updating

Verify Zabbix config:

# Test zabbix_sender manually
zabbix_sender -z <zabbix-server> -s <host> -k snapshot.injective-1.status -o 1

Check:

  • ZABBIX_SERVER and ZABBIX_HOST set correctly
  • zabbix_sender installed in container
  • Firewall allows connection to Zabbix server

Development

Build locally

go mod download
go build -o snapshot-service .

Run locally

Using flags (recommended):

./snapshot-service \
  --chain-id test \
  --binary echo \
  --home-dir /tmp/test \
  --snapshot-interval 5m \
  --log-format console \
  --log-level debug

Using environment variables:

export CHAIN_ID=test
export BINARY=echo  # Use echo for testing without real node
export HOME_DIR=/tmp/test
export SNAPSHOT_INTERVAL=5m
export LOG_FORMAT=console

./snapshot-service

Testing without real node

Set --binary echo (or BINARY=echo) and use /tmp/test for testing the service workflow without a real blockchain node.

Help Documentation

# View all available flags
./snapshot-service --help

# Test help during development
go run . --help

Migration from Bash Script

Before (Bash)

#!/bin/bash
while true; do
    injectived start &
    # wait for sync...
    kill $PID
    cosmprund prune /data
    tar -czf backup.tar.gz /data
    sleep 86400
done

After (This Service)

Just deploy and configure. Same workflow, but with:

  • βœ… Monitoring
  • βœ… Proper error handling
  • βœ… Health checks
  • βœ… Metrics
  • βœ… Smart Zabbix alerting

License

MIT

Support

For issues or questions:

  • Check logs first
  • Review metrics/health endpoints
  • Verify configuration
  • Test with LOG_LEVEL=debug

About

A golang application for creating pruned cosmos sdk node snapshots

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors