Skip to content

andotherstuff/pensieve

Repository files navigation

Pensieve

Archive, explore, and analyze Nostr data at network scale.

Overview

Pensieve is an archive-first Nostr indexer. It stores canonical events in a local notepack archive (source of truth), with ClickHouse as a derived analytics index.

┌─────────────────────────────────────────────────────────────────┐
│                        Event Sources                            │
│  ┌───────────┐    ┌───────────┐    ┌────────────────────────┐  │
│  │   JSONL   │    │ Protobuf  │    │     Live Relays        │  │
│  │   Files   │    │  Archives │    │ (WebSocket + NIP-65)   │  │
│  └─────┬─────┘    └─────┬─────┘    └───────────┬────────────┘  │
└────────┼────────────────┼──────────────────────┼───────────────┘
         │                │                      │
         └────────────────┴──────────────────────┘
                          │
                          ▼
                  ┌───────────────┐
                  │  DedupeIndex  │  RocksDB - tracks seen event IDs
                  └───────┬───────┘
                          │
                          ▼
                  ┌───────────────┐
                  │ SegmentWriter │  Notepack segments (gzipped)
                  └───────┬───────┘
                          │
              ┌───────────┴───────────┐
              │                       │
              ▼                       ▼
      ┌───────────────┐       ┌───────────────┐
      │ Storage Box   │       │  ClickHouse   │
      │   (rclone)    │       │    Index      │
      └───────────────┘       └───────────────┘

Current Status

Component Status
pensieve-core ✅ Event validation, notepack encoding, metrics
pensieve-ingest ✅ Live relay ingestion + backfill binaries + relay quality tracking
pensieve-serve 🚧 Placeholder
Deployment ✅ Docker + systemd setup in pensieve-deploy/

Ingestion Modes

Pensieve supports three event sources, all feeding into the same pipeline:

1. Live Relay Ingestion (Real-time)

Connect to Nostr relays via WebSocket and stream events in real-time. Includes automatic relay discovery via NIP-65.

# Basic usage (connects to default seed relays)
./target/release/pensieve-ingest \
  -o ./segments \
  --rocksdb-path ./data/dedupe

# Full options
./target/release/pensieve-ingest \
  -o ./segments \
  --rocksdb-path ./data/dedupe \
  --clickhouse-url http://localhost:8123 \
  --seed-relays "wss://relay.damus.io,wss://nos.lol,wss://relay.primal.net" \
  --max-relays 50 \
  --metrics-port 9091

Features:

  • Generates ephemeral keypair for NIP-42 authentication
  • Auto-discovers relays via NIP-65 (kind:10002 events)
  • Disconnects from paid/whitelist relays that reject auth
  • Relay quality tracking (SQLite) with scoring and optimization
  • Graceful shutdown on Ctrl+C

Options:

Flag Default Description
-o, --output ./segments Output directory for notepack segments
--rocksdb-path ./data/dedupe RocksDB path for deduplication
--clickhouse-url (none) ClickHouse URL for indexing
--clickhouse-db nostr ClickHouse database name
--seed-relays 8 public relays Comma-separated relay URLs
--max-relays 30 Maximum concurrent relay connections
--no-discovery false Disable NIP-65 relay discovery
--segment-size 268435456 Max segment size before sealing (256MB)
--no-compress false Disable gzip compression
--metrics-port 9091 Prometheus metrics port (0 to disable)
--relay-db-path ./data/relay-stats.db SQLite path for relay quality tracking
--import-relays-csv (none) Import relays from georelays CSV
--score-interval-secs 300 Score recomputation interval

2. JSONL Backfill (Batch)

Import events from JSONL files (one JSON event per line). Useful for importing strfry dumps or other JSON exports.

# Single file
./target/release/backfill-jsonl \
  -i events.jsonl \
  -o ./segments

# Directory of files
./target/release/backfill-jsonl \
  -i ./jsonl-data/ \
  -o ./segments \
  --rocksdb-path ./data/dedupe \
  --clickhouse-url http://localhost:8123

# Fast mode (skip validation, trust input)
./target/release/backfill-jsonl \
  -i ./jsonl-data/ \
  -o ./segments \
  --skip-validation

Options:

Flag Default Description
-i, --input (required) Input JSONL file or directory
-o, --output (required) Output directory for segments
--rocksdb-path (none) RocksDB path for deduplication
--clickhouse-url (none) ClickHouse URL for indexing
--skip-validation false Skip ID/signature verification
--limit (none) Limit number of files to process
--progress-interval 100000 Log progress every N events
--metrics-port 9091 Prometheus metrics port

3. Protobuf Backfill (Batch)

Import events from length-delimited protobuf files. Supports local files or S3 with resumable progress.

# Local files
./target/release/backfill-proto \
  -i ./proto-segments/ \
  -o ./segments \
  --rocksdb-path ./data/dedupe

# S3 source with resume support
./target/release/backfill-proto \
  --s3-bucket your-bucket \
  --s3-prefix nostr/segments/ \
  -o ./segments \
  --rocksdb-path ./data/dedupe \
  --clickhouse-url http://localhost:8123

# Limit for testing
./target/release/backfill-proto \
  --s3-bucket your-bucket \
  --s3-prefix nostr/segments/ \
  -o ./segments \
  --limit 5

Options:

Flag Default Description
-i, --input (local) Input protobuf file or directory
--s3-bucket (S3) S3 bucket name
--s3-prefix (S3) S3 key prefix
-o, --output (required) Output directory for segments
--rocksdb-path (none) RocksDB path for deduplication
--clickhouse-url (none) ClickHouse URL for indexing
--gzip auto Force gzip decompression
--skip-validation false Skip ID/signature verification
--progress-file auto Path for S3 resume progress
--temp-dir system Temp directory for S3 downloads
--metrics-port 9091 Prometheus metrics port

Quick Start

# Build all binaries
cargo build --release

# Start live ingestion (simplest form)
./target/release/pensieve-ingest -o ./segments --rocksdb-path ./data/dedupe

# Or run a backfill from local JSONL files
./target/release/backfill-jsonl -i ./events.jsonl -o ./segments

Pipeline Components

All ingestion modes use the same core pipeline:

  • DedupeIndex (RocksDB) — Tracks seen event IDs to prevent duplicates
  • SegmentWriter — Writes events to gzipped notepack segments, seals at size threshold
  • ClickHouseIndexer — Indexes sealed segments into ClickHouse for analytics
  • RelayManager (SQLite) — Tracks per-relay quality metrics and optimizes connections

Relay Quality Tracking

The ingestion daemon tracks per-relay quality metrics to prioritize high-value relays:

Metrics tracked:

  • Novel event rate — Events/hour that passed deduplication (not seen before)
  • Uptime — Connection success rate over time
  • Connection history — Attempts, successes, failures

Scoring:

score = (novel_rate_normalized × 0.7) + (uptime × 0.3)

Relays are ranked by score. Seed relays (manually curated) get a floor score of 0.5 to prevent eviction.

Slot optimization:

  • Every 5 minutes (configurable), scores are recomputed
  • Up to 5% of relay slots can be swapped per cycle
  • Low-scoring discovered relays are replaced with higher-scoring unconnected ones
  • Relays with 10+ consecutive failures are blocked

Bootstrap from georelays:

You can import relay lists from the georelays project:

# Download the georelays CSV
curl -o data/relays/georelays.csv \
  https://raw.githubusercontent.com/permissionlesstech/georelays/main/nostr_relays.csv

# Run with import
./target/release/pensieve-ingest \
  --import-relays-csv ./data/relays/georelays.csv \
  -o ./segments

Query relay stats:

sqlite3 ./data/relay-stats.db "
  SELECT url, score, novel_rate_7d, uptime_7d
  FROM relays r
  JOIN relay_scores s ON r.url = s.relay_url
  ORDER BY score DESC
  LIMIT 20;
"

Deployment

See pensieve-deploy/README.md for production setup with Docker, ClickHouse, Prometheus, and Grafana.

Documentation

License

See LICENSE.

About

The eternal memory well of Nostr

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published