Diff CSVs at any size, locally, with progress — no server required.
diffly is a CSV comparison toolkit designed around a single principle:
It should handle very large files without crashing, freezing the UI, or running out of memory — and it should always show progress.
The end-state is:
- a fast CLI for local use,
- a web app that runs entirely in the browser (no backend),
- optional desktop/mobile apps using the same engine.
- Phase 1 (
diffly-python): complete - Phase 2 (
diffly-rustengine): complete - Phase 3 (
diffly-cli): complete - Phase 4 (
diffly-web): MVP complete (worker + wasm + streaming fallback)
Manual validation guide: docs/MANUAL_TEST_PLAN.md
Prerequisites:
- Python 3.12+
- Rust stable toolchain (
rustup+cargo) - Node.js 22 (
.nvmrcis included) - Optional:
wasm-packwhen rebuilding the browser WASM bundle
From repo root:
make doctor
make bootstrap
make lint
make test
make checkmake bootstrap installs the web app dependencies. make check mirrors the main local quality loop: linting, tests, typecheck, and production web build.
For a contributor-focused walkthrough, see CONTRIBUTING.md.
Most CSV diff tools fall into one of two traps:
- They load entire files into memory (fast for small files, but can crash on large ones).
- They work line-by-line (bounded memory), but don’t support meaningful keyed diffs or rich change reporting.
diffly aims for:
- bounded memory (works out-of-core),
- keyed comparisons (join-like semantics),
- streaming output (doesn’t require holding results in RAM),
- great UX (progress, ETA, cancellation),
- one core engine reused by CLI + web + (optionally) desktop/mobile.
To support “any file size (eventually)” without killing browser tabs or machines, the diff algorithm is designed to be:
- streaming: never load the full CSV into memory
- partitioned: split inputs into manageable chunks
- spilling to disk: store intermediate partitions outside RAM
When a comparison is keyed (user selects key columns):
- Stream rows from CSV A and CSV B.
- Compute a stable
keyfrom configured key columns. - Hash the key to choose a partition:
p = hash(key) % N - Write row records into partition files:
A_pandB_p - Emit progress based on bytes read / total bytes.
For each partition p:
- Load
A_pinto a hash map keyed bykey(bounded because partitions are small). - Stream
B_p:- if key not in map → added
- if key in map:
- compare rows → unchanged or changed
- remove key from map
- leftover keys in map → removed
- Emit progress by partitions completed + bytes processed.
This keeps memory bounded. Performance comes from streaming IO and per-partition in-memory joins.
diffly is built to be a good citizen on laptops and phones:
- ✅ Never block the UI (runs in workers / background threads)
- ✅ Show progress continuously (bytes processed, phase, partitions)
- ✅ Provide an ETA when possible (moving-average throughput)
- ✅ Support cancel (abort streams + cleanup)
- ✅ Fail gracefully if storage is insufficient (no crashes)
“Any size” in-browser is limited by:
- device storage
- browser quota
- implementation differences (iOS Safari is often most restrictive)
diffly should:
- estimate required spill space,
- detect low space as best it can,
- stop early with a clear message when the device cannot support the job.
The project is intended to be built in stages:
Goal: lock down diff semantics quickly.
- A reference implementation written in Python
- A canonical diff output model
- A suite of fixtures + golden outputs that define expected behavior:
- quoting, escapes, newlines
- header mismatch
- missing columns
- duplicate keys
- type coercion rules (if any)
- stable ordering rules for deterministic tests
Important: Phase 1 is for semantics, not performance.
The design must still be compatible with streaming + out-of-core execution later.
Goal: build the production-grade engine with bounded memory.
- Rust crate(s) implementing the partitioned diff
- Storage backends abstracted behind traits:
- native temp directory
- browser OPFS / IndexedDB
- Progress reporting + cancel support built in
Goal: a great local CLI built on the Rust engine.
- single binary distribution
- outputs:
- JSON / JSONL for streaming consumption
- optional human-readable summary tables
- optional HTML report (later)
We will reuse the Rust engine in different environments.
Goal: browser-only app for local comparisons.
- compile Rust core to WASM
- run in a Web Worker (never block UI thread)
- storage backend uses OPFS (preferred) or IndexedDB (fallback)
- React/TS/Next for UI (progress, results rendering)
Goal: desktop/mobile apps using Flutter + Rust engine.
- Dart FFI wrapper around Rust native engine
- Flutter UI for desktop/mobile
- web may still use the React/WASM app (same engine, different UI)
All implementations should be able to emit a consistent, streamable diff.
A suggested approach is JSON Lines (JSONL). The output stream may include both data events (added/removed/changed rows) and meta events (schema/progress/warnings/stats) so UIs can show summaries and progress without materializing the entire diff.
Data events:
addedremovedchangedunchanged(optional / usually omitted for size)
Meta events:
schemaprogresswarningstats(summary frames; may be emitted periodically and/or at end)
Events should distinguish:
- row identity (key columns, key values)
- row location (optional original row numbers in A and B)
- row delta (field-level differences suitable for “inspect” view)
Schema event (early):
{
"type": "schema",
"columns_a": ["id", "name", "status"],
"columns_b": ["id", "name", "status"],
"header_row_a": 1,
"header_row_b": 1
}Progress event (periodic):
{
"type": "progress",
"phase": "partitioning",
"bytes_read": 52428800,
"bytes_total": 209715200,
"partitions_total": 256,
"partitions_done": 0,
"throughput_bytes_per_sec": 18432000,
"eta_seconds": 8.4
}Changed event (row-level delta, optimized for inspection):
{
"type": "changed",
"key": { "id": "123" },
"loc": { "a_row": 1823, "b_row": 1840 },
"changed": ["name"],
"before": { "id": "123", "name": "Alice", "status": "active" },
"after": { "id": "123", "name": "Alicia", "status": "active" },
"delta": {
"name": { "from": "Alice", "to": "Alicia" }
}
}Added event:
{
"type": "added",
"key": { "id": "999" },
"loc": { "a_row": null, "b_row": 20491 },
"row": { "id": "999", "name": "Zoe", "status": "active" }
}Removed event:
{
"type": "removed",
"key": { "id": "888" },
"loc": { "a_row": 19910, "b_row": null },
"row": { "id": "888", "name": "Sam", "status": "inactive" }
}Duplicate key warning (important for keyed semantics):
{
"type": "warning",
"code": "duplicate_key",
"key": { "id": "123" },
"count_a": 2,
"count_b": 1,
"a_rows": [10, 99],
"b_rows": [12],
"message": "Duplicate key encountered; diff semantics may be ambiguous for this key."
}Stats frame (periodic or final, ideal for the “summary → map” UX):
{
"type": "stats",
"rows_total_compared": 250000,
"rows_added": 1203,
"rows_removed": 17,
"rows_changed": 84,
"cells_changed": 9412,
"changed_cells_by_column": {
"price": 5120,
"status": 820,
"updated_at": 3200,
"name": 272
},
"truncated": false
}Because the engine may partition and spill to disk, output ordering must be explicitly defined for deterministic tests.
Suggested rule (v1):
- Emit results in partition order (
p = 0..N-1). - Within a partition, emit in a stable order (e.g. by key hash then key bytes).
- Include
partition_idin progress frames (optional) to aid debugging.
diffly currently supports two compare strategies plus an order option:
- external hash-join semantics using key columns
- supports adds/removes/changes per key
- must define behavior for duplicate keys (warn/error/group)
- compare row i in A to row i in B
- adds/removes reflect differing lengths
- treat each row as a value; compare as a multiset
- emits
added/removeddeltas and unchanged counts - rejects keyed + ignore-row-order combination
diffly/
diffly-spec/ # fixtures + golden outputs + semantics docs
diffly-python/ # reference implementation + runs spec tests
diffly-rust/
diffly-core/ # diff semantics + model types
diffly-engine/ # partitioning + out-of-core algorithm
diffly-native/ # native storage backend (tempdir)
diffly-wasm/ # wasm bindings + OPFS/IDB backend
diffly-cli/ # CLI wrapper using diffly-native
diffly-web/ # React/Next UI that runs diffly-wasm in a worker
diffly-dart/ # optional Dart FFI wrapper for diffly-native
- Library-first: core engine is reusable, CLI/UI are thin wrappers.
- Deterministic: stable ordering rules for outputs and testability.
- Streaming everywhere: parsing, partitioning, diffing, output.
- Storage-pluggable: same algorithm, different backends.
- Progress as a first-class API: every long operation reports state.
- Cancellation: must be supported for all long-running operations.
This repo is early-stage and actively evolving.
You can run a positional diff locally using the Python reference implementation:
make diff A=path/to/a.csv B=path/to/b.csvEnable keyed mode by providing key columns:
make diff A=path/to/a.csv B=path/to/b.csv KEY=idComposite keys are also supported via make:
make diff A=a.csv B=b.csv KEYS=id,regionFor sorted-header comparison mode:
make diff A=a.csv B=b.csv KEY=id HEADER_MODE=sortedAlias flag:
make diff A=a.csv B=b.csv KEY=id IGNORE_COLUMN_ORDER=1Ignore row order (multiset positional):
make diff A=a.csv B=b.csv IGNORE_ROW_ORDER=1For composite keys, call the script directly:
python3 diffly-python/diffly.py --a a.csv --b b.csv --key id --key regionThe command emits JSONL events (schema, row events, stats) to stdout.
Current semantics are strict string comparison with hard errors for duplicate column names, missing key columns, and missing key values.
Rust workspace now lives in diffly-rust/ with:
diffly-core(CSV diff semantics)diffly-engine(engine/runtime boundary with sink, cancel, progress, and partitioned spill utilities)diffly-cli(native CLI surface for positional + keyed diff)diffly-conformance(runsdiffly-specfixtures)
Run Rust parity checks with:
make test-spec-rust
make test-spec-rust-engine PARTITIONS=4Run the native Rust CLI via:
make diff-rust A=a.csv B=b.csvEnable keyed mode by providing key columns:
make diff-rust A=a.csv B=b.csv KEY=idProgress events can be emitted with:
make diff-rust A=a.csv B=b.csv KEY=id EMIT_PROGRESS=1In partitioned mode, progress phases currently emit as: partitioning -> diff_partitions -> emit_events.
Rust CLI supports multiple output modes now:
# default JSONL stream (event-per-line)
make diff-rust A=a.csv B=b.csv
# single JSON array output
make diff-rust A=a.csv B=b.csv FORMAT=json
# human-readable summary table
make diff-rust A=a.csv B=b.csv FORMAT=summary
# git-style terminal diff inspector
make diff-rust A=a.csv B=b.csv KEY=id FORMAT=diff
# write any mode to file
make diff-rust A=a.csv B=b.csv FORMAT=json OUT=/tmp/diff.jsonFORMAT=diff is a human-only terminal view. It keeps jsonl/json unchanged and renders:
- changed rows first
- changed columns with inline
A | Bcomparisons - field-level emphasis using existing
changed/before/after/deltaevent data - inline substring markers for changed cell values when color is disabled
Set NO_COLOR=1 if you want plain-text diff output without ANSI styling.
For keyed mode in any format, include KEY=... or KEYS=....
Ignore column order alias:
make diff-rust A=a.csv B=b.csv KEY=id IGNORE_COLUMN_ORDER=1Ignore row order (multiset positional):
make diff-rust A=a.csv B=b.csv IGNORE_ROW_ORDER=1diffly-web/ is now seeded from the DiffyData-style UX and wired to diffly runtime semantics:
- runs comparison in a dedicated Web Worker (main thread stays responsive)
- uses Rust/WASM path for small files
- uses partitioned IndexedDB spill path for larger files (and falls back to in-memory worker mode if IndexedDB is unavailable)
- supports cancel + phase progress frames in the UI
- supports strategy selection: positional, ignore row order, compare by key
- supports ignore-column-order toggle and key input only when keyed strategy is selected
Install and run:
make web-install
make web-lint
make web-devType-check/build:
make web-typecheck
npm --prefix diffly-web run buildBuild/update Rust WASM package for web:
make wasm-build-webRust CLI now uses the partitioned engine path by default (64 partitions). Override partition count with:
make diff-rust A=a.csv B=b.csv KEY=id PARTITIONS=64Force the legacy non-partitioned core path (for debugging/comparison):
make diff-rust A=a.csv B=b.csv KEY=id NO_PARTITIONS=1GitHub Actions now runs on pull requests and pushes to main:
make test-specmake test-pythonpython -m compileall diffly-python- a fixture-backed CLI smoke test via
python diffly-python/diffly.py ... - Python smoke coverage for positional default, ignore-row-order, and ignore-column-order
- Rust fmt check +
cargo test+ Rust fixture conformance (make test-spec-rust) - Rust engine conformance parity mode (
make test-spec-rust-engine PARTITIONS=4) - Rust CLI smoke test via
make diff-rust ... - Rust smoke coverage for positional default, ignore-row-order, and ignore-column-order
- Rust partitioned CLI smoke test via
make diff-rust ... PARTITIONS=4 - Web app lint/typecheck/build (
make web-lint+make web-typecheck+make web-build)
To preserve execution context across sessions/agents:
docs/STATUS.mdtracks current progress, blockers, and next steps.docs/DECISIONS.mdtracks active semantic/product decisions.docs/HANDOFF.mdprovides a quick resume checklist.
Next steps:
- Add OPFS/IndexedDB spill backend integration for browser large-file path.
- Add browser-level large-file regression automation (100MB+ behavior checks).
- Continue expanding fixture coverage for CSV edge cases.
