Self-optimising actor-based S3-compatible object storage downloader
A high-performance Rust rewrite of super-obj-soaker, built on the rebar actor runtime. Drop-in replacement with identical CLI and environment variable contracts.
- Actor-based concurrency — Uses rebar's GenServer and DynamicSupervisor (BEAM-inspired actor model) for lock-free coordination
- Self-optimising — Dynamically scales worker count based on measured throughput, stopping when gains plateau
- Resumable downloads — Skips files that match on size and modification time, re-downloads partial files
- S3-compatible — Works with any S3-compatible service via
--endpoint-url(AWS S3, MinIO, SeaweedFS, etc.) - Include/exclude patterns — Shell glob filtering matching AWS CLI behaviour
- Fault tolerant — DynamicSupervisor provides crash recovery for worker processes
- Configurable — All tuning parameters exposed via environment variables
graph TD
RT["Rebar Runtime"]
RT --> C["Coordinator<br/><i>GenServer</i>"]
RT --> WS["WorkerSupervisor<br/><i>DynamicSupervisor</i>"]
RT --> O["Optimizer<br/><i>Process</i>"]
WS --> W0["Worker 0"]
WS --> W1["Worker 1"]
WS --> W2["Worker ..."]
WS --> WN["Worker N"]
W0 -- "Call: Task" --> C
W0 -- "Cast: TaskComplete" --> C
W1 -- "Call: Task" --> C
W1 -- "Cast: TaskComplete" --> C
O -- "Call: Stats" --> C
O -- "Call: Progress" --> C
O -- "start_child()" --> WS
style RT fill:#2d3748,color:#fff
style C fill:#3182ce,color:#fff
style WS fill:#38a169,color:#fff
style O fill:#d69e2e,color:#fff
style W0 fill:#48bb78,color:#fff
style W1 fill:#48bb78,color:#fff
style W2 fill:#48bb78,color:#fff
style WN fill:#48bb78,color:#fff
| Actor | Type | Responsibility |
|---|---|---|
| Coordinator | GenServer | Owns the task queue, tracks progress and byte counters, responds to Call/Cast messages |
| WorkerSupervisor | DynamicSupervisor | Manages worker lifecycle with crash recovery (max_restarts=100 per 60s) |
| Worker | Process (temporary) | Requests tasks from Coordinator, downloads objects, reports completion |
| Optimizer | Process | Samples throughput every OPTIMIZATION_INTERVAL seconds, scales workers when >5% improvement detected |
- Every interval, the Optimizer queries the Coordinator for bytes downloaded since the last check
- Calculates current speed in MB/s and appends to a 60-sample rolling history
- If speed increased >5% from the previous sample and is below
MAX_SPEED, spawns 5 additional workers (capped atMAX_PROCESSES) - If speed plateaued or decreased, holds steady — workers are never scaled down
- Exits when all tasks are complete
Tested against a local SeaweedFS instance — 100 x 10MB files (1GB total), MAX_PROCESSES=16, 3 runs each. Filesystem caches dropped between runs.
| Tool | Best | Avg | Best MB/s | Avg MB/s |
|---|---|---|---|---|
AWS S3 CLI (aws s3 sync) |
5.33s | 6.88s | 187.6 | 145.3 |
| super-obj-soaker (Python) | 5.33s | 6.21s | 187.6 | 161.0 |
| mega-obj-soaker (Rust) | 4.14s | 4.15s | 241.5 | 241.0 |
xychart-beta
title "Download Throughput (MB/s) — Higher is Better"
x-axis ["AWS S3 CLI", "super-obj-soaker", "mega-obj-soaker"]
y-axis "MB/s" 0 --> 280
bar [145.3, 161.0, 241.0]
mega-obj-soaker is 29% faster than super-obj-soaker and 22% faster than the AWS CLI at peak, with near-zero variance across runs (4.14s, 4.15s, 4.16s) compared to the high jitter seen in Python (5.33–7.93s) and the CLI (5.33–8.79s).
See benchmark/ for the full reproducible benchmark script and methodology.
Download the latest binary from Releases:
# Linux x86_64
curl -LO https://github.com/alexandernicholson/mega-obj-soaker/releases/latest/download/mega-obj-soaker-x86_64-unknown-linux-gnu.tar.gz
tar xzf mega-obj-soaker-x86_64-unknown-linux-gnu.tar.gz
sudo mv mega-obj-soaker /usr/local/bin/Requires Rust 1.85+ (edition 2024) and a local clone of rebar as a sibling directory.
git clone https://github.com/alexandernicholson/rebar.git
git clone https://github.com/alexandernicholson/mega-obj-soaker.git
cd mega-obj-soaker
cargo build --release
cp target/release/mega-obj-soaker /usr/local/bin/docker-compose up --buildmega-obj-soaker <SOURCE> <DESTINATION> [OPTIONS]
| Argument | Description |
|---|---|
SOURCE |
S3 URI (e.g. s3://bucket/prefix) |
DESTINATION |
Local filesystem path |
| Flag | Default | Description |
|---|---|---|
--region <REGION> |
us-east-1 |
AWS region |
--log-level <LEVEL> |
INFO |
Logging level: DEBUG, INFO, WARNING, ERROR |
--endpoint-url <URL> |
— | Custom S3 endpoint for S3-compatible services |
--include <PATTERN> |
— | Glob pattern to include (repeatable) |
--exclude <PATTERN> |
— | Glob pattern to exclude (repeatable) |
Basic download:
mega-obj-soaker s3://mybucket/data /local/pathCustom region and endpoint (MinIO, SeaweedFS, etc.):
mega-obj-soaker s3://mybucket/data /local/path \
--region us-west-2 \
--endpoint-url http://localhost:8333Selective download with patterns:
mega-obj-soaker s3://mybucket/data /local/path \
--exclude "*" --include "*.parquet"Tuned for maximum throughput:
MAX_PROCESSES=64 OPTIMIZATION_INTERVAL=5 \
mega-obj-soaker s3://mybucket/data /local/pathAll tuning is done via environment variables:
| Variable | Type | Default | Description |
|---|---|---|---|
MIN_PROCESSES |
int | 1 |
Minimum concurrent download workers |
MAX_PROCESSES |
int | 16 |
Maximum concurrent download workers |
MAX_SPEED |
float | ~unlimited | Speed ceiling in MB/s — stops scaling when reached |
OPTIMIZATION_INTERVAL |
float | 10.0 |
Seconds between throughput sampling |
MAX_RETRIES |
int | 3 |
Retry attempts per failed download |
RETRY_DELAY |
float | 5.0 |
Seconds between retries |
S3_VERIFY_SSL |
bool | true |
Verify SSL certificates |
mega-obj-soaker/
├── Cargo.toml
├── Dockerfile
├── docker-compose.yaml
├── run_tests.sh
├── benchmark/
│ ├── README.md # Benchmark methodology and options
│ └── run.sh # Reproducible benchmark script
└── src/
├── main.rs # CLI entry point, rebar runtime bootstrap
├── config.rs # Environment variable parsing
├── s3.rs # S3 client, listing, download with resume/retry
├── pattern.rs # Glob include/exclude filtering
├── coordinator.rs # GenServer: task queue + stats + progress
├── worker.rs # Worker download loop
├── optimizer.rs # Throughput monitoring and worker scaling
└── supervisor.rs # Supervision tree wiring
# Unit tests
cargo test
# Integration tests with SeaweedFS
./run_tests.sh