Toy/experimental clone of Apache Cassandra written in Rust, mostly using OpenAI Codex.
- Features
- Design tradeoffs
- Query Syntax
- Development
- Storage Backends
- Example / Docker Compose Cluster
- Consistency, Hinted Handoff, and Read Repair
- Maintenance Commands
- Monitoring
- Distributed Tracing
- Performance Benchmarking
- Flamegraph Profiling
- gRPC API and CLI for submitting SQL queries
- Data Structure: Stores data in a log-structured merge tree
- Storage: Column-oriented SSTable placeholders with bloom filters and zone maps to speed up queries; persist to local or S3 AWS backends
- Durability / Recovery: Sharded write-ahead logs for durability and in-memory tables for parallel ingestion
- Deployment: Dockerfile and docker-compose for containerized deployment and local testing
- Scalability: Horizontally scalable
- Gossip: Cluster membership and liveness detection via gossip with health checks
- Consistency: Tunable read replica count with hinted handoff and read repair
- Lightweight Transactions: for compare and set operations
Like Cassandra itself, cass is an AP system:
- Consistency: Consistency is relaxed, last-write-wins conflict resolution
- Availability: always writable, tunably consistent, fault-tolerant through replication
- Partition tolerance: will continue to work even if parts of the cluster cannot communicate
The built-in SQL engine understands a small subset of SQL:
INSERTof akey/valuepair into a tableUPDATEandDELETEstatements targeting a single keySELECTwith optionalWHEREfilters,ORDER BY,GROUP BY,DISTINCT, simple aggregate functions (COUNT,MIN,MAX,SUM) andLIMIT- Table management statements such as
CREATE TABLE,DROP TABLE,TRUNCATE TABLE, andSHOW TABLES - Lightweight transactions (compare-and-set):
INSERT ... IF NOT EXISTSUPDATE ... IF col = value(simple equality predicates)- Returns a single row with
[applied]and, on failure, the current values for the checked columns
Note on creating partition and clustering keys:
the first column in PRIMARY KEY(...) will be the partition key, subsequent columns will be indexed as clustering keys.
So for the example id will be the partition key and c will be a clustering key:
CREATE TABLE t (
id int,
c text,
k int,
v text,
PRIMARY KEY (id,c)
);
Cass supports Cassandra-style lightweight transactions for conditional writes using a Paxos-like protocol across the partition's replicas. Two forms are supported:
INSERT ... IF NOT EXISTSβ inserts only when the row does not exist.UPDATE ... IF col = value [AND col2 = value2 ...]β applies the update only if all equality predicates match the current row.
Response shape mirrors Cassandra:
- On success: a single row with
[applied] = true. - On failure: a single row with
[applied] = falseand the current values for the columns referenced in theIFclause.
Consistency for LWT is QUORUM and does not depend on the server's read
consistency setting. Normal reads continue to use the configured server-level
read consistency (ONE/QUORUM/ALL via --read-consistency).
Notes:
- The
IFclause is parsed only when it appears as a trailing clause outside of quotes or comments (e.g.,-- comment). Using the word "if" inside data values or identifiers does not trigger LWT behavior.
cargo test # run unit tests
cargo run -- server # start the gRPC server on port 8080
cargo run -- server --read-consistency one # only one healthy replica required for readsBefore submitting changes, ensure the code is formatted and tests pass:
cargo fmt
cargo testThe project uses idiomatic Rust patterns with small, focused functions. See the
module-level comments in src/ for a high-level overview of the architecture.
The server supports both local filesystem storage and Amazon S3.
Local storage is the default. Specify a directory with --data-dir:
cargo run -- --data-dir ./dataTo use S3, set AWS credentials in the environment and provide the bucket name:
AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... \
cargo run -- --storage s3 --bucket my-bucketAWS_REGION controls the region (default us-east-1).
With the server running you can insert and query data using gRPC. The provided docker-compose.yml starts a five-node cluster using local
storage with a replication factor of three where you can try this out.
Start the cluster:
docker compose upConnect using the built-in REPL and run some queries:
$ cass repl http://localhost:8080
> CREATE TABLE orders (customer_id TEXT, order_id TEXT, order_date TEXT, PRIMARY KEY(customer_id, order_id))
CREATE TABLE 1 table
> INSERT INTO orders VALUES ('nike', 'abc123', '2025-08-25')
INSERT 1 row
> INSERT INTO orders VALUES ('nike', 'def456', '2025-08-26')
INSERT 1 row
> SELECT * FROM orders WHERE customer_id = 'nike'
customer_id order_date order_id
0 nike 2025-08-25 abc123
1 nike 2025-08-26 def456
(2 rows)
> SELECT COUNT(1) FROM orders WHERE customer_id = 'nike'
count
0 2
(1 rows)
> SELECT * FROM orders WHERE customer_id = 'nike' AND order_id = 'abc123'
customer_id order_date order_id
0 nike 2025-08-25 abc123
(1 rows)
# Check-and-set (lightweight transaction) examples
> UPDATE orders SET order_date = '2025-08-27'
WHERE customer_id = 'nike' AND order_id = 'abc123' IF order_date = '2025-08-25'
[applied]
0 true
(1 rows)
> UPDATE orders SET order_date = '2025-08-28'
WHERE customer_id = 'nike' AND order_id = 'abc123' IF order_date = '2025-08-25'
[applied] order_date
0 false 2025-08-27
(1 rows)Cass uses a coordinator-per-request model similar to Cassandra. Each statement is routed to the partition's replicas using a Murmur3-based ring. The coordinator enforces consistency and repairs divergence opportunistically:
-
Read consistency: configured per server with
--read-consistency {one|quorum|all}. If there are not enough healthy replicas for the chosen level, the read fails. -
Hinted handoff: if a write targets replicas that are currently unhealthy, the coordinator writes to the healthy replicas and stores a βhintβ for each unreachable replica (original SQL and timestamp). When a replica becomes healthy again, the coordinator replays the hints to bring it up to date. Hints are in-memory and best-effort (non-durable across coordinator restarts).
-
Read repair: for non-broadcast reads, the coordinator gathers results from healthy replicas, merges by last-write-wins (timestamp), and returns the freshest value. If divergence is detected, it proactively repairs healthy stale replicas by sending the freshest value to them, and records hints for any replicas that are still down.
Tip: you can use cass panic <node> to temporarily mark a node as unhealthy and
observe hinted handoff and subsequent repair behavior when it recovers.
The CLI exposes helper commands useful during testing:
cass flush <node>instructs the specified node to broadcast a flush to all peers.cass panic <node>forces the target node to report itself as unhealthy for 60 seconds.
Each node exposes Prometheus metrics on the gRPC port plus 1000 at
/metrics (for example, if the server listens on 8080, metrics are
available on 9080). The provided docker-compose.yml also starts
Prometheus and Grafana. After running
docker compose upvisit http://localhost:3000 and sign in with the default
admin/admin credentials. The Grafana instance is preconfigured with the
Prometheus data source so you can explore metrics such as gRPC request
counts, peer health, RAM and CPU usage, and SSTable disk usage.
There is also a preconfigured dashboard with basic metrics from all instances. Screenshot below:
cass emits OpenTelemetry spans for every gRPC request, coordinator hop, and
lightweight-transaction phase. Spans are exported via OTLP/gRPC to the endpoint
specified in OTEL_EXPORTER_OTLP_ENDPOINT (default http://127.0.0.1:4317).
Each process also honours the standard OpenTelemetry metadata:
OTEL_SERVICE_NAMEβ service name reported to the collector (cassby default).OTEL_SERVICE_INSTANCE_IDβ a per-process identifier (defaults to the gRPC listen address when runningcass server).
Clients (cass flush, cass panic, cass repl, and the CassClient
helpers) automatically propagate the current span context through gRPC metadata
so child spans on downstream nodes appear under the correct parent in your
tracing backend.
The bundled docker-compose.yml now includes a Jaeger all-in-one deployment.
Start the full stack with:
docker compose upAll five nodes forward spans to the Jaeger collector (http://jaeger:4317)
with unique service.instance.ids. Open the Jaeger UI at
http://localhost:16686, select the cass service, and you can inspect the
span graph for queries, replication fan-out, LWT prepare/propose cycles, and
hint replays.
To run the server outside of Docker, point it at any OTLP collector:
OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4317 \
OTEL_SERVICE_NAME=cass-dev \
OTEL_SERVICE_INSTANCE_ID=dev-node-1 \
cargo run -- server --node-addr http://127.0.0.1:8080Then launch Jaeger separately if desired:
docker run --rm -p 16686:16686 -p 4317:4317 jaegertracing/all-in-one:1.54Once the server starts handling requests (for example via cass repl), spans
will appear in Jaeger with parent/child relationships that follow the full
replication and LWT flow across nodes.
The repository includes a simple harness for comparing write and read throughput of cass against a reference Apache Cassandra node.
scripts/perf_compare.sh # runs both databases and stores metrics in ./perf-resultsThe script starts a three-node cass cluster and uses the example program perf_client to drive load. Metrics from the first cass node and nodetool statistics from Cassandra are written to the perf-results directory for analysis.
Current results (in comparison to Cassandra):
5 nodes, replication factor 3, read consistency QUORUM, x axis is number of threads querying
Generate a CPU flamegraph for the query endpoint with a one-shot helper that runs the server under cargo flamegraph and drives load via the example perf client:
scripts/flamegraph_query.shOutputs an SVG under perf-results/, e.g. perf-results/query_flamegraph.svg.
Notes:
- Prereqs:
cargo install flamegraph. On Linux, ensureperfis installed and accessible; on macOS,dtracerequiressudoand may require Developer Mode. - Tunables via env vars:
NODE(defaulthttp://127.0.0.1:8080),OPS(default10000),THREADS(default32),OUTDIR(defaultperf-results), andEXTRA_SERVER_ARGSto pass through tocass server.
Current flamegraph for simple reads and writes:
Single node
