diffly

Diff CSVs at any size, locally, with progress — no server required.

diffly is a CSV comparison toolkit designed around a single principle:

It should handle very large files without crashing, freezing the UI, or running out of memory — and it should always show progress.

The end-state is:

a fast CLI for local use,
a web app that runs entirely in the browser (no backend),
optional desktop/mobile apps using the same engine.

Phase status

Phase 1 (diffly-python): complete
Phase 2 (diffly-rust engine): complete
Phase 3 (diffly-cli): complete
Phase 4 (diffly-web): MVP complete (worker + wasm + streaming fallback)

Manual validation guide: docs/MANUAL_TEST_PLAN.md

Quick start

Prerequisites:

Python 3.12+
Rust stable toolchain (rustup + cargo)
Node.js 22 (.nvmrc is included)
Optional: wasm-pack when rebuilding the browser WASM bundle

From repo root:

make doctor
make bootstrap
make lint
make test
make check

make bootstrap installs the web app dependencies. make check mirrors the main local quality loop: linting, tests, typecheck, and production web build.

For a contributor-focused walkthrough, see CONTRIBUTING.md.

Why diffly exists

Most CSV diff tools fall into one of two traps:

They load entire files into memory (fast for small files, but can crash on large ones).
They work line-by-line (bounded memory), but don’t support meaningful keyed diffs or rich change reporting.

diffly aims for:

bounded memory (works out-of-core),
keyed comparisons (join-like semantics),
streaming output (doesn’t require holding results in RAM),
great UX (progress, ETA, cancellation),
one core engine reused by CLI + web + (optionally) desktop/mobile.

Core idea: streaming + out-of-core diff engine

To support “any file size (eventually)” without killing browser tabs or machines, the diff algorithm is designed to be:

streaming: never load the full CSV into memory
partitioned: split inputs into manageable chunks
spilling to disk: store intermediate partitions outside RAM

External hash-join (partitioned diff)

When a comparison is keyed (user selects key columns):

Pass 1 — Partitioning

Stream rows from CSV A and CSV B.
Compute a stable key from configured key columns.
Hash the key to choose a partition: p = hash(key) % N
Write row records into partition files: A_p and B_p
Emit progress based on bytes read / total bytes.

Pass 2 — Diff partitions one-by-one

For each partition p:

Load A_p into a hash map keyed by key (bounded because partitions are small).
Stream B_p:
- if key not in map → added
- if key in map:
  - compare rows → unchanged or changed
  - remove key from map
leftover keys in map → removed
Emit progress by partitions completed + bytes processed.

This keeps memory bounded. Performance comes from streaming IO and per-partition in-memory joins.

UX requirements (non-negotiable)

diffly is built to be a good citizen on laptops and phones:

✅ Never block the UI (runs in workers / background threads)
✅ Show progress continuously (bytes processed, phase, partitions)
✅ Provide an ETA when possible (moving-average throughput)
✅ Support cancel (abort streams + cleanup)
✅ Fail gracefully if storage is insufficient (no crashes)

Browser storage reality

“Any size” in-browser is limited by:

device storage
browser quota
implementation differences (iOS Safari is often most restrictive)

diffly should:

estimate required spill space,
detect low space as best it can,
stop early with a clear message when the device cannot support the job.

Repo roadmap (phased plan)

The project is intended to be built in stages:

Phase 1 — `diffly-python` (reference semantics + golden tests)

Goal: lock down diff semantics quickly.

A reference implementation written in Python
A canonical diff output model
A suite of fixtures + golden outputs that define expected behavior:
- quoting, escapes, newlines
- header mismatch
- missing columns
- duplicate keys
- type coercion rules (if any)
- stable ordering rules for deterministic tests

Important: Phase 1 is for semantics, not performance.
The design must still be compatible with streaming + out-of-core execution later.

Phase 2 — `diffly-rust` (real engine)

Goal: build the production-grade engine with bounded memory.

Rust crate(s) implementing the partitioned diff
Storage backends abstracted behind traits:
- native temp directory
- browser OPFS / IndexedDB
Progress reporting + cancel support built in

Phase 3 — `diffly-cli`

Goal: a great local CLI built on the Rust engine.

single binary distribution
outputs:
- JSON / JSONL for streaming consumption
- optional human-readable summary tables
- optional HTML report (later)

Phase 4 — UI options

We will reuse the Rust engine in different environments.

Option (ii) — Rust → WASM + Web app (best fit for “no server”)

Goal: browser-only app for local comparisons.

compile Rust core to WASM
run in a Web Worker (never block UI thread)
storage backend uses OPFS (preferred) or IndexedDB (fallback)
React/TS/Next for UI (progress, results rendering)

Option (i) — Flutter desktop/mobile via Dart FFI

Goal: desktop/mobile apps using Flutter + Rust engine.

Dart FFI wrapper around Rust native engine
Flutter UI for desktop/mobile
web may still use the React/WASM app (same engine, different UI)

Canonical diff model (draft)

All implementations should be able to emit a consistent, streamable diff.

A suggested approach is JSON Lines (JSONL). The output stream may include both data events (added/removed/changed rows) and meta events (schema/progress/warnings/stats) so UIs can show summaries and progress without materializing the entire diff.

Output stream event types (proposed)

Data events:

added
removed
changed
unchanged (optional / usually omitted for size)

Meta events:

schema
progress
warning
stats (summary frames; may be emitted periodically and/or at end)

Row identity vs presentation

Events should distinguish:

row identity (key columns, key values)
row location (optional original row numbers in A and B)
row delta (field-level differences suitable for “inspect” view)

Example events

Schema event (early):

{
  "type": "schema",
  "columns_a": ["id", "name", "status"],
  "columns_b": ["id", "name", "status"],
  "header_row_a": 1,
  "header_row_b": 1
}

Progress event (periodic):

{
  "type": "progress",
  "phase": "partitioning",
  "bytes_read": 52428800,
  "bytes_total": 209715200,
  "partitions_total": 256,
  "partitions_done": 0,
  "throughput_bytes_per_sec": 18432000,
  "eta_seconds": 8.4
}

Changed event (row-level delta, optimized for inspection):

{
  "type": "changed",
  "key": { "id": "123" },
  "loc": { "a_row": 1823, "b_row": 1840 },
  "changed": ["name"],
  "before": { "id": "123", "name": "Alice", "status": "active" },
  "after":  { "id": "123", "name": "Alicia", "status": "active" },
  "delta": {
    "name": { "from": "Alice", "to": "Alicia" }
  }
}

Added event:

{
  "type": "added",
  "key": { "id": "999" },
  "loc": { "a_row": null, "b_row": 20491 },
  "row": { "id": "999", "name": "Zoe", "status": "active" }
}

Removed event:

{
  "type": "removed",
  "key": { "id": "888" },
  "loc": { "a_row": 19910, "b_row": null },
  "row": { "id": "888", "name": "Sam", "status": "inactive" }
}

Duplicate key warning (important for keyed semantics):

{
  "type": "warning",
  "code": "duplicate_key",
  "key": { "id": "123" },
  "count_a": 2,
  "count_b": 1,
  "a_rows": [10, 99],
  "b_rows": [12],
  "message": "Duplicate key encountered; diff semantics may be ambiguous for this key."
}

Stats frame (periodic or final, ideal for the “summary → map” UX):

{
  "type": "stats",
  "rows_total_compared": 250000,
  "rows_added": 1203,
  "rows_removed": 17,
  "rows_changed": 84,
  "cells_changed": 9412,
  "changed_cells_by_column": {
    "price": 5120,
    "status": 820,
    "updated_at": 3200,
    "name": 272
  },
  "truncated": false
}

Determinism and ordering rules

Because the engine may partition and spill to disk, output ordering must be explicitly defined for deterministic tests.

Suggested rule (v1):

Emit results in partition order (p = 0..N-1).
Within a partition, emit in a stable order (e.g. by key hash then key bytes).
Include partition_id in progress frames (optional) to aid debugging.

Diff modes (semantics)

diffly currently supports two compare strategies plus an order option:

`keyed` mode (enabled when key columns are provided)

external hash-join semantics using key columns
supports adds/removes/changes per key
must define behavior for duplicate keys (warn/error/group)

`positional` mode (default when no key columns are provided)

compare row i in A to row i in B
adds/removes reflect differing lengths

`--ignore-row-order` (positional multiset option)

treat each row as a value; compare as a multiset
emits added/removed deltas and unchanged counts
rejects keyed + ignore-row-order combination

Monorepo layout (proposed)

diffly/
  diffly-spec/        # fixtures + golden outputs + semantics docs
  diffly-python/      # reference implementation + runs spec tests
  diffly-rust/
    diffly-core/      # diff semantics + model types
    diffly-engine/    # partitioning + out-of-core algorithm
    diffly-native/    # native storage backend (tempdir)
    diffly-wasm/      # wasm bindings + OPFS/IDB backend
  diffly-cli/         # CLI wrapper using diffly-native
  diffly-web/         # React/Next UI that runs diffly-wasm in a worker
  diffly-dart/        # optional Dart FFI wrapper for diffly-native

Design constraints / principles

Library-first: core engine is reusable, CLI/UI are thin wrappers.
Deterministic: stable ordering rules for outputs and testability.
Streaming everywhere: parsing, partitioning, diffing, output.
Storage-pluggable: same algorithm, different backends.
Progress as a first-class API: every long operation reports state.
Cancellation: must be supported for all long-running operations.

Development status

This repo is early-stage and actively evolving.

Current CLI (Phase 1 reference)

You can run a positional diff locally using the Python reference implementation:

make diff A=path/to/a.csv B=path/to/b.csv

Enable keyed mode by providing key columns:

make diff A=path/to/a.csv B=path/to/b.csv KEY=id

Composite keys are also supported via make:

make diff A=a.csv B=b.csv KEYS=id,region

For sorted-header comparison mode:

make diff A=a.csv B=b.csv KEY=id HEADER_MODE=sorted

Alias flag:

make diff A=a.csv B=b.csv KEY=id IGNORE_COLUMN_ORDER=1

Ignore row order (multiset positional):

make diff A=a.csv B=b.csv IGNORE_ROW_ORDER=1

For composite keys, call the script directly:

python3 diffly-python/diffly.py --a a.csv --b b.csv --key id --key region

The command emits JSONL events (schema, row events, stats) to stdout. Current semantics are strict string comparison with hard errors for duplicate column names, missing key columns, and missing key values.

Rust engine (Phase 2 complete)

Rust workspace now lives in diffly-rust/ with:

diffly-core (CSV diff semantics)
diffly-engine (engine/runtime boundary with sink, cancel, progress, and partitioned spill utilities)
diffly-cli (native CLI surface for positional + keyed diff)
diffly-conformance (runs diffly-spec fixtures)

Run Rust parity checks with:

make test-spec-rust
make test-spec-rust-engine PARTITIONS=4

Run the native Rust CLI via:

make diff-rust A=a.csv B=b.csv

Enable keyed mode by providing key columns:

make diff-rust A=a.csv B=b.csv KEY=id

Progress events can be emitted with:

make diff-rust A=a.csv B=b.csv KEY=id EMIT_PROGRESS=1

In partitioned mode, progress phases currently emit as: partitioning -> diff_partitions -> emit_events.

Rust CLI (Phase 3 complete)

Rust CLI supports multiple output modes now:

# default JSONL stream (event-per-line)
make diff-rust A=a.csv B=b.csv

# single JSON array output
make diff-rust A=a.csv B=b.csv FORMAT=json

# human-readable summary table
make diff-rust A=a.csv B=b.csv FORMAT=summary

# git-style terminal diff inspector
make diff-rust A=a.csv B=b.csv KEY=id FORMAT=diff

# write any mode to file
make diff-rust A=a.csv B=b.csv FORMAT=json OUT=/tmp/diff.json

FORMAT=diff is a human-only terminal view. It keeps jsonl/json unchanged and renders:

changed rows first
changed columns with inline A | B comparisons
field-level emphasis using existing changed / before / after / delta event data
inline substring markers for changed cell values when color is disabled

Set NO_COLOR=1 if you want plain-text diff output without ANSI styling.

For keyed mode in any format, include KEY=... or KEYS=....

Ignore column order alias:

make diff-rust A=a.csv B=b.csv KEY=id IGNORE_COLUMN_ORDER=1

Ignore row order (multiset positional):

make diff-rust A=a.csv B=b.csv IGNORE_ROW_ORDER=1

Web app (Phase 4 MVP complete)

diffly-web/ is now seeded from the DiffyData-style UX and wired to diffly runtime semantics:

runs comparison in a dedicated Web Worker (main thread stays responsive)
uses Rust/WASM path for small files
uses partitioned IndexedDB spill path for larger files (and falls back to in-memory worker mode if IndexedDB is unavailable)
supports cancel + phase progress frames in the UI
supports strategy selection: positional, ignore row order, compare by key
supports ignore-column-order toggle and key input only when keyed strategy is selected

Install and run:

make web-install
make web-lint
make web-dev

Type-check/build:

make web-typecheck
npm --prefix diffly-web run build

Build/update Rust WASM package for web:

make wasm-build-web

Rust CLI now uses the partitioned engine path by default (64 partitions). Override partition count with:

make diff-rust A=a.csv B=b.csv KEY=id PARTITIONS=64

Force the legacy non-partitioned core path (for debugging/comparison):

make diff-rust A=a.csv B=b.csv KEY=id NO_PARTITIONS=1

CI checks

GitHub Actions now runs on pull requests and pushes to main:

make test-spec
make test-python
python -m compileall diffly-python
a fixture-backed CLI smoke test via python diffly-python/diffly.py ...
Python smoke coverage for positional default, ignore-row-order, and ignore-column-order
Rust fmt check + cargo test + Rust fixture conformance (make test-spec-rust)
Rust engine conformance parity mode (make test-spec-rust-engine PARTITIONS=4)
Rust CLI smoke test via make diff-rust ...
Rust smoke coverage for positional default, ignore-row-order, and ignore-column-order
Rust partitioned CLI smoke test via make diff-rust ... PARTITIONS=4
Web app lint/typecheck/build (make web-lint + make web-typecheck + make web-build)

Project memory

To preserve execution context across sessions/agents:

docs/STATUS.md tracks current progress, blockers, and next steps.
docs/DECISIONS.md tracks active semantic/product decisions.
docs/HANDOFF.md provides a quick resume checklist.

Next steps:

Add OPFS/IndexedDB spill backend integration for browser large-file path.
Add browser-level large-file regression automation (100MB+ behavior checks).
Continue expanding fixture coverage for CSV edge cases.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
.rulesync/rules		.rulesync/rules
diffly-python		diffly-python
diffly-rust		diffly-rust
diffly-spec		diffly-spec
diffly-web		diffly-web
docs		docs
.editorconfig		.editorconfig
.gitignore		.gitignore
.nvmrc		.nvmrc
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Makefile		Makefile
README.md		README.md
firebase.json		firebase.json

Folders and files

Latest commit

History

Repository files navigation

diffly

Phase status

Quick start

Why diffly exists

Core idea: streaming + out-of-core diff engine

External hash-join (partitioned diff)

Pass 1 — Partitioning

Pass 2 — Diff partitions one-by-one

UX requirements (non-negotiable)

Browser storage reality

Repo roadmap (phased plan)

Phase 1 — diffly-python (reference semantics + golden tests)

Phase 2 — diffly-rust (real engine)

Phase 3 — diffly-cli

Phase 4 — UI options

Option (ii) — Rust → WASM + Web app (best fit for “no server”)

Option (i) — Flutter desktop/mobile via Dart FFI

Canonical diff model (draft)

Output stream event types (proposed)

Row identity vs presentation

Example events

Determinism and ordering rules

Diff modes (semantics)

keyed mode (enabled when key columns are provided)

positional mode (default when no key columns are provided)

--ignore-row-order (positional multiset option)

Monorepo layout (proposed)

Design constraints / principles

Development status

Current CLI (Phase 1 reference)

Rust engine (Phase 2 complete)

Rust CLI (Phase 3 complete)

Web app (Phase 4 MVP complete)

CI checks

Project memory

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Phase 1 — `diffly-python` (reference semantics + golden tests)

Phase 2 — `diffly-rust` (real engine)

Phase 3 — `diffly-cli`

`keyed` mode (enabled when key columns are provided)

`positional` mode (default when no key columns are provided)

`--ignore-row-order` (positional multiset option)

Packages