Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions DISASTER_RECOVERY_RUNBOOK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Nestera — Disaster Recovery Runbook

## Overview

| Item | Value |
|---|---|
| Backup schedule | Daily at 02:00 UTC |
| Retention | 30 days |
| Storage | AWS S3 (`nestera-db-backups`) with `STANDARD_IA` storage class |
| Encryption | AES-256-CBC (key stored in `BACKUP_ENCRYPTION_KEY`) + S3 SSE-AES256 |
| PITR | PostgreSQL WAL archiving enabled (see `docker-compose.yml`) |
| RTO target | < 2 hours |
| RPO target | < 24 hours (daily backup) / near-zero with WAL replay |

---

## 1. Assess the Incident

1. Check backup status: `GET /api/backup/status` (admin token required)
2. Check recent backup records: `GET /api/backup/records`
3. Identify the last successful backup and its `createdAt` timestamp
4. Determine target recovery point (latest backup vs. specific point-in-time)

---

## 2. Point-in-Time Recovery (WAL-based)

Use this when you need to recover to a specific timestamp between backups.

```bash
# 1. Stop the application
docker compose stop api

# 2. Restore the base backup (see Section 3 first)

# 3. Create recovery config
cat > /var/lib/postgresql/data/recovery.conf <<EOF
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '2026-03-29 14:30:00 UTC'
recovery_target_action = 'promote'
EOF

# 4. Start PostgreSQL — it will replay WAL up to the target time
docker compose start postgres

# 5. Monitor recovery progress
docker compose logs -f postgres | grep -E "recovery|redo"
```

---

## 3. Full Backup Restore

### 3a. Download and decrypt the backup

```bash
# Download from S3
aws s3 cp s3://nestera-db-backups/backups/<filename>.dump.enc /tmp/restore.dump.enc \
--region us-east-1

# Decrypt (requires BACKUP_ENCRYPTION_KEY as 64 hex chars)
node -e "
const fs = require('fs');
const crypto = require('crypto');
const key = Buffer.from(process.env.BACKUP_ENCRYPTION_KEY, 'hex');
const data = fs.readFileSync('/tmp/restore.dump.enc');
const iv = data.subarray(0, 16);
const enc = data.subarray(16);
const decipher = crypto.createDecipheriv('aes-256-cbc', key, iv);
fs.writeFileSync('/tmp/restore.dump', Buffer.concat([decipher.update(enc), decipher.final()]));
console.log('Decrypted successfully');
"
```

### 3b. Restore to PostgreSQL

```bash
# Drop and recreate the target database
psql "$DATABASE_URL" -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname='nestera' AND pid <> pg_backend_pid();"
psql "postgresql://nestera:nestera@localhost:5432/postgres" -c "DROP DATABASE IF EXISTS nestera;"
psql "postgresql://nestera:nestera@localhost:5432/postgres" -c "CREATE DATABASE nestera;"

# Restore
pg_restore --no-password -d "$DATABASE_URL" /tmp/restore.dump

# Verify
psql "$DATABASE_URL" -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';"
```

### 3c. Restart the application

```bash
docker compose up -d api
curl http://localhost:3001/health
```

---

## 4. Trigger On-Demand Backup

```bash
curl -X POST https://api.nestera.io/api/backup/trigger \
-H "Authorization: Bearer <admin_token>"
```

---

## 5. Trigger On-Demand Restore Test

```bash
curl -X POST https://api.nestera.io/api/backup/restore-test \
-H "Authorization: Bearer <admin_token>"
```

---

## 6. Monitoring & Alerts

- Backup monitor runs **every hour** — alerts ops@nestera.io if no successful backup in 26h
- Failed backup check runs at **02:30 UTC** daily
- Monthly restore test runs on the **first Sunday of each month at 04:00 UTC**
- All backup events are logged with size (MB) and duration (ms) metrics

---

## 7. Encryption Key Rotation

1. Generate a new key: `openssl rand -hex 32`
2. Update `BACKUP_ENCRYPTION_KEY` in your secrets manager / `.env`
3. Restart the API service
4. Old backups remain decryptable with the old key — store it securely until all old backups expire (30 days)

---

## 8. Contacts

| Role | Contact |
|---|---|
| On-call engineer | ops@nestera.io |
| AWS account owner | devops@nestera.io |
| Escalation | dev@nestera.io |
10 changes: 10 additions & 0 deletions backend/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,13 @@ HOSPITAL_RETRY_DELAY=1000
HOSPITAL_REQUEST_TIMEOUT=10000
HOSPITAL_CIRCUIT_BREAKER_THRESHOLD=5
HOSPITAL_CIRCUIT_BREAKER_TIMEOUT=60000

# ── Database Backup ───────────────────────────────────────────────────────────
BACKUP_S3_BUCKET=nestera-db-backups
BACKUP_S3_REGION=us-east-1
BACKUP_AWS_ACCESS_KEY_ID=your_aws_access_key_id
BACKUP_AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
# 64 hex characters = 32-byte AES-256 key. Generate with: openssl rand -hex 32
BACKUP_ENCRYPTION_KEY=your_64_char_hex_encryption_key_here_replace_this_value_now
BACKUP_RETENTION_DAYS=30
BACKUP_TMP_DIR=/tmp
12 changes: 12 additions & 0 deletions backend/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,20 @@ services:
POSTGRES_USER: nestera
POSTGRES_PASSWORD: nestera
POSTGRES_DB: nestera
# Enable WAL archiving for point-in-time recovery
POSTGRES_INITDB_ARGS: "--wal-segsize=16"
command: >
postgres
-c wal_level=replica
-c archive_mode=on
-c archive_command='cp %p /var/lib/postgresql/wal_archive/%f'
-c max_wal_senders=3
-c wal_keep_size=512
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
- postgres_wal_archive:/var/lib/postgresql/wal_archive

api:
build: .
Expand All @@ -25,3 +35,5 @@ services:

volumes:
postgres_data:
postgres_wal_archive:

Loading
Loading