Cosmos Snapshot Service

Automated snapshot creation service for Cosmos SDK chains with built-in monitoring and beautiful web UI.

Overview

This service automates the process of creating chain snapshots for Cosmos SDK nodes:

Starts a backup node
Waits for it to sync
Stops the node
Prunes the database (optional)
Creates a compressed snapshot
Cleans up old snapshots
Repeats on configured interval

Key Features:

✅ Multi-chain support - One Docker image for all chains (Injective, Osmosis, Cosmos Hub, Juno, etc.)
✅ Beautiful web UI - Modern landing page to browse and download snapshots
✅ Full monitoring via Prometheus metrics
✅ Health endpoints for service checks
✅ Zabbix integration with smart alerting
✅ Flag-based config - Command-line flags with env var fallback
✅ Auto-organized backups - Snapshots automatically sectioned by chain
✅ HashiCorp Nomad ready with Consul integration
✅ Structured logging (JSON or console)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    One Docker Image                         │
│  ┌──────────┬──────────┬──────────┬──────────┬──────────┐  │
│  │injectived│ osmosisd │  gaiad   │  junod   │cosmprund │  │
│  └──────────┴──────────┴──────────┴──────────┴──────────┘  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │         Snapshot Service (Go Binary)                │   │
│  │  • Start node → Sync → Stop → Prune → Backup       │   │
│  │  • HTTP Health & Metrics                            │   │
│  │  • Auto-organize by chain                           │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
         ┌─────────────────────────────────────┐
         │   Shared Volume: /backups/          │
         │                                      │
         │   ├── injective-1/                  │
         │   │   └── snapshots.tar.gz          │
         │   ├── osmosis-1/                    │
         │   │   └── snapshots.tar.gz          │
         │   └── cosmoshub-4/                  │
         │       └── snapshots.tar.gz          │
         └─────────────────────────────────────┘
                           │
                           ▼
         ┌─────────────────────────────────────┐
         │    Nginx (Beautiful Web UI)         │
         │                                      │
         │  🚀 Cosmos Snapshot Service         │
         │  ┌──────┐ ┌──────┐ ┌──────┐        │
         │  │ INJ  │ │ OSMO │ │ ATOM │        │
         │  └──────┘ └──────┘ └──────┘        │
         └─────────────────────────────────────┘

Deployment Model:

Per Chain: One job/container instance per blockchain
Shared: Single nginx serves all chains via one web interface
Scalable: Add new chains by deploying new instances (same image!)

Why This Over Bash Script?

Problems with bash script:

❌ No way to monitor service health
❌ Can't distinguish "creating snapshot" from "crashed"
❌ Zabbix templates alert when node stops (which is expected!)
❌ All errors silenced (|| true everywhere)
❌ No metrics on snapshot size, duration, success rate

This service solves:

✅ Dedicated health endpoint - know if service is running
✅ Separate metrics for node vs snapshot service
✅ Zabbix template that understands backup workflow
✅ Proper error handling and reporting
✅ Detailed metrics and observability

Quick Start

1. Build Docker Image

The Docker image includes binaries for multiple chains:

docker build -t cosmos-snapshot-service:latest .

Included chains:

Injective (injectived)
Osmosis (osmosisd)
Cosmos Hub (gaiad)
Juno (junod)

2. Run the Service

Using flags (recommended):

snapshot-service \
  --chain-id injective-1 \
  --binary injectived \
  --home-dir /data/.injectived

Using environment variables:

export CHAIN_ID=injective-1
export BINARY=injectived
export HOME_DIR=/data/.injectived
snapshot-service

Using Docker:

docker run -v /data:/data \
  cosmos-snapshot-service:latest \
  --chain-id injective-1 \
  --binary injectived \
  --home-dir /data/.injectived

3. Deploy with Nomad

Use the included snapshot.nomad job file:

nomad job run snapshot.nomad

The Nomad job includes:

snapshot-service task - Creates snapshots automatically
nginx task - Serves snapshots via beautiful web UI

4. Access the Web UI

Navigate to http://your-host/ to see the snapshot landing page with all available chains.

5. Import Zabbix Template

Open Zabbix UI → Data collection → Templates
Click Import
Upload zabbix_template_snapshot.yaml
Create host with this template
Set macros: {HOST.IP}, {$HEALTH_PORT}, {$METRICS_PORT}

6. Verify

# Check health
curl http://localhost:8080/health

# Check status
curl http://localhost:8080/status

# Check metrics
curl http://localhost:9090/metrics

# Browse snapshots (if nginx is running)
curl http://localhost/

Configuration

Configuration via command-line flags (preferred) or environment variables (fallback).

Priority: Flags > Environment Variables > Defaults

Command-Line Flags

Run snapshot-service --help for full documentation.

Required:

--chain-id string        Chain identifier (e.g., injective-1)
--binary string          Chain binary name (e.g., injectived)
--home-dir string        Chain home directory

Snapshot Configuration:

--snapshot-interval duration   Snapshot interval (default: 24h)
--backup-dir string           Backup directory (default: {HOME_DIR}/backups/{CHAIN_ID})
--retention-count int         Number of snapshots to keep (default: 2)
--prune                       Run cosmprund before backup (default: true)

Node Configuration:

--rpc-endpoint string    RPC endpoint (default: http://localhost:26657)
--sync-timeout duration  Max sync wait time (default: 24h)

Monitoring:

--health-port string     Health port (default: 8080)
--metrics-port string    Metrics port (default: 9090)
--zabbix-server string   Zabbix server address (optional)
--zabbix-host string     Zabbix hostname (optional)

Logging:

--log-level string    Level: debug, info, warn, error (default: info)
--log-format string   Format: json, console (default: json)

Environment Variables

All flags map to environment variables (uppercase with underscores):

Flag	Environment Variable	Example
`--chain-id`	`CHAIN_ID`	`injective-1`
`--binary`	`BINARY`	`injectived`
`--home-dir`	`HOME_DIR`	`/data/.injectived`
`--snapshot-interval`	`SNAPSHOT_INTERVAL`	`24h`
`--backup-dir`	`BACKUP_DIR`	`/backups/injective-1`
`--retention-count`	`RETENTION_COUNT`	`3`
`--prune`	`PRUNE_BEFORE_BACKUP`	`true`
`--rpc-endpoint`	`RPC_ENDPOINT`	`http://localhost:26657`
`--sync-timeout`	`SYNC_TIMEOUT`	`24h`
`--health-port`	`HEALTH_PORT`	`8080`
`--metrics-port`	`METRICS_PORT`	`9090`
`--log-level`	`LOG_LEVEL`	`info`
`--log-format`	`LOG_FORMAT`	`json`

Backup Directory Structure

Snapshots are automatically organized by chain:

/data/.injectived/
  └── backups/
      └── injective-1/
          ├── injective-1-12345-2024-01-15T10-00-00.tar.gz
          └── injective-1-12346-2024-01-16T10-00-00.tar.gz

/data/.osmosisd/
  └── backups/
      └── osmosis-1/
          └── osmosis-1-98765-2024-01-15T10-00-00.tar.gz

This allows a single nginx instance to serve all chains from one directory structure.

Monitoring

Health Endpoints

GET /health - Simple health check

{
  "status": "healthy"
}

GET /status - Detailed status

{
  "status": "healthy",
  "chain_id": "injective-1",
  "last_backup_time": "2024-01-15T10:30:00Z",
  "node_running": false,
  "creating_backup": false,
  "next_backup_in": "20h15m30s",
  "snapshot_interval": "24h"
}

Prometheus Metrics

Available at http://localhost:9090/metrics:

Metric	Type	Description
`snapshot_success_total`	Counter	Total successful snapshots
`snapshot_error_total`	Counter	Total failed snapshots
`snapshot_last_success_timestamp`	Gauge	Unix timestamp of last success
`snapshot_duration_seconds`	Histogram	Time to create snapshot
`snapshot_size_megabytes`	Gauge	Size of last snapshot (MB)
`snapshot_service_health_status`	Gauge	Service health (1=healthy, 0=unhealthy)
`snapshot_node_running`	Gauge	Node currently running (1=yes, 0=no)
`snapshot_backup_creating`	Gauge	Backup being created (1=yes, 0=no)

Zabbix Monitoring

The included Zabbix template monitors:

✅ What it monitors:

Snapshot creation success/failure
Time since last successful snapshot
Service health (separate from node health!)
Snapshot size trends

🎯 Smart alerting:

✅ Alerts if no snapshot in 24-48 hours
✅ Alerts if snapshot creation fails
✅ Alerts if service becomes unhealthy
❌ Does NOT alert when node stops (expected behavior!)

Key difference from normal node monitoring:

Normal template: Alerts when node stops ❌
This template: Understands node stops for backups ✅

How It Works

Normal Operation

[Service Running] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━> Time
    │
    ├─> Start Node ━━━━━━━━━━━━━━━━━━━━━> Node Running
    │
    ├─> Wait for Sync (polling /status)
    │
    ├─> Stop Node ━━━━━━━━━━━━━━━━━━━━━> Node Stopped (EXPECTED!)
    │
    ├─> Prune Database (cosmprund)
    │
    ├─> Create Backup (tar.gz) ━━━━━━━> Backup Creating
    │
    ├─> Cleanup Old Backups
    │
    ├─> Update Metrics
    │
    └─> Wait {SNAPSHOT_INTERVAL} ━━━━━> Repeat

Health States

State	Node Running	Creating Backup	Health
Syncing	✅ Yes	❌ No	✅ Healthy
Backing up	❌ No	✅ Yes	✅ Healthy
Waiting	❌ No	❌ No	✅ Healthy
Failed	❌ No	❌ No	❌ Unhealthy

Service is unhealthy if:

Last snapshot failed
No snapshot in 2x interval (e.g., 48h if interval is 24h)

Service is healthy even when:

Node is stopped (normal during backup)
Waiting between snapshots

Comparison: Bash vs Go Service

Aspect	Bash Script	This Service
Monitoring	Zabbix sender only	Full Prometheus + Zabbix
Health Check	None	HTTP endpoints
Error Handling	Silenced (`\|\| true`)	Proper errors + metrics
Status Visibility	None	Real-time status endpoint
Alerting	Node stop = alert ❌	Smart alerting ✅
Metrics	None	Size, duration, success rate
Logs	Plain text	Structured JSON
Service Health	Can't tell if crashed	Monitored continuously

Multi-Chain Deployment

Architecture

One Docker image supports all chains:

The Docker image contains binaries for Injective, Osmosis, Cosmos Hub, Juno, etc.
Each chain deployment is a separate job/instance
All share the same nginx web UI

Recommended setup (Nomad):

Job: injective-snapshot → Uses injectived binary
Job: osmosis-snapshot   → Uses osmosisd binary
Job: nginx-snapshots    → Serves all chains via web UI

Example: Adding Multiple Chains

1. Deploy Injective:

nomad job run injective-snapshot.nomad

2. Deploy Osmosis:

nomad job run osmosis-snapshot.nomad

3. Deploy Nginx (shared):

nomad job run nginx-snapshots.nomad

Each chain writes to its own subdirectory automatically, and nginx serves them all from one beautiful landing page.

Directory Structure (Multi-Chain)

/shared/backups/
  ├── injective-1/
  │   ├── injective-1-12345-2024-01-15.tar.gz
  │   └── injective-1-12346-2024-01-16.tar.gz
  ├── osmosis-1/
  │   ├── osmosis-1-98765-2024-01-15.tar.gz
  │   └── osmosis-1-98766-2024-01-16.tar.gz
  └── cosmoshub-4/
      └── cosmoshub-4-45678-2024-01-15.tar.gz

The web UI at http://your-host/ automatically discovers and displays all available chains!

Deployment Scenarios

Scenario 1: Nomad (Recommended)

Use the included snapshot.nomad file:

Configs from Consul KV or flags
Automatic restart on failure
Service discovery
Health checks
Shared nginx for web UI

Troubleshooting

Service shows unhealthy

Check:

curl http://localhost:8080/status

Look for last_backup_error field.

Common causes:

Node won't sync (check RPC endpoint)
Disk space full (check backup directory)
cosmprund not installed
Permissions issues

No snapshots being created

Check logs:

# Nomad
nomad alloc logs <alloc-id>

# Docker
docker logs <container-id>

# Systemd
journalctl -u cosmos-snapshot -f

Verify:

Binary path correct (which injectived)
Home directory accessible
Backup directory writable

Snapshots created but Zabbix not updating

Verify Zabbix config:

# Test zabbix_sender manually
zabbix_sender -z <zabbix-server> -s <host> -k snapshot.injective-1.status -o 1

Check:

ZABBIX_SERVER and ZABBIX_HOST set correctly
zabbix_sender installed in container
Firewall allows connection to Zabbix server

Development

Build locally

go mod download
go build -o snapshot-service .

Run locally

Using flags (recommended):

./snapshot-service \
  --chain-id test \
  --binary echo \
  --home-dir /tmp/test \
  --snapshot-interval 5m \
  --log-format console \
  --log-level debug

Using environment variables:

export CHAIN_ID=test
export BINARY=echo  # Use echo for testing without real node
export HOME_DIR=/tmp/test
export SNAPSHOT_INTERVAL=5m
export LOG_FORMAT=console

./snapshot-service

Testing without real node

Set --binary echo (or BINARY=echo) and use /tmp/test for testing the service workflow without a real blockchain node.

Help Documentation

# View all available flags
./snapshot-service --help

# Test help during development
go run . --help

Migration from Bash Script

Before (Bash)

#!/bin/bash
while true; do
    injectived start &
    # wait for sync...
    kill $PID
    cosmprund prune /data
    tar -czf backup.tar.gz /data
    sleep 86400
done

After (This Service)

Just deploy and configure. Same workflow, but with:

✅ Monitoring
✅ Proper error handling
✅ Health checks
✅ Metrics
✅ Smart Zabbix alerting

License

MIT

Support

For issues or questions:

Check logs first
Review metrics/health endpoints
Verify configuration
Test with LOG_LEVEL=debug

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
static		static
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
config.go		config.go
go.mod		go.mod
go.sum		go.sum
health.go		health.go
main.go		main.go
metrics.go		metrics.go
service.go		service.go
zabbix-template-snapshot-service.yaml		zabbix-template-snapshot-service.yaml

Folders and files

Latest commit

History

Repository files navigation

Cosmos Snapshot Service

Overview

Architecture

Why This Over Bash Script?

Quick Start

1. Build Docker Image

2. Run the Service

3. Deploy with Nomad

4. Access the Web UI

5. Import Zabbix Template

6. Verify

Configuration

Command-Line Flags

Environment Variables

Backup Directory Structure

Monitoring

Health Endpoints

Prometheus Metrics

Zabbix Monitoring

How It Works

Normal Operation

Health States

Comparison: Bash vs Go Service

Multi-Chain Deployment

Architecture

Example: Adding Multiple Chains

Directory Structure (Multi-Chain)

Deployment Scenarios

Scenario 1: Nomad (Recommended)

Troubleshooting

Service shows unhealthy

No snapshots being created

Snapshots created but Zabbix not updating

Development

Build locally

Run locally

Testing without real node

Help Documentation

Migration from Bash Script

Before (Bash)

After (This Service)

License

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages