Skip to content

feat: DumpMergeWriter + merge-aware compaction#148

Open
JustMaier wants to merge 107 commits intomainfrom
feat/dump-merge-writer
Open

feat: DumpMergeWriter + merge-aware compaction#148
JustMaier wants to merge 107 commits intomainfrom
feat/dump-merge-writer

Conversation

@JustMaier
Copy link
Copy Markdown
Contributor

Summary

  • DumpMergeWriter: Direct read-modify-write into data.bin during dump phases, bypassing ops log + compaction entirely. Uses 1024-bucket striped locks for rayon thread safety.
  • merge-aware compaction: Fixes correctness bug where cold/hot compact used LWW, silently dropping earlier Merge ops for the same key. DocSiloAdapter now sets a merge function that concatenates Mi arrays.
  • dump_processor wiring: Auto-detects whether DumpMergeWriter is available (data.bin exists from images phase) and uses it; falls back to ops log for the first phase.

How it works

  1. Images phase writes doc ops to ops log → compact_after_dumps creates data.bin + index.bin
  2. Tags/tools/techniques/resources phases: prepare_dump_merge() returns DumpMergeWriter
  3. Per-slot batched Mi values are merge_put() directly into data.bin (read existing → decode → concat Mi → re-encode → write in-place)
  4. No ops log writes, no compaction step needed for subsequent phases

Files changed

File What
crates/datasilo/src/lib.rs DumpMergeWriter struct, prepare_dump_merge, set_merge_fn, compact_cold_merge, hot compact merge
crates/datasilo/src/hash_index.rs update_existing_concurrent (thread-safe entry update)
src/silos/doc_format.rs merge_encoded_docs, merge_encoded_docs_into
src/silos/doc_silo_adapter.rs prepare_dump_merge wrapper, set_merge_fn on open
src/sync/dump_processor.rs DumpMergeWriter integration, collect_doc_op merge_writer param

Tests

  • 3 DumpMergeWriter tests (basic, overflow, 1000-entry concurrent rayon)
  • 2 merge_encoded_docs tests (Mi concatenation, non-Mi replace)
  • 3 compaction tests (cold merge, hot merge, LWW backwards compat)

Test plan

  • All datasilo crate tests pass (33 pass, 3 pre-existing key=0 sentinel failures)
  • All doc_format tests pass (8/8)
  • All doc_silo_adapter tests pass (2/2)
  • All hash_index tests pass (12/12)
  • Small-scale 14M dump validation (Scarlet running)
  • Full-scale 107M dump validation (pending small-scale results)

🤖 Generated with Claude Code

JustMaier and others added 30 commits April 3, 2026 15:02
…zations

Multi-phase dump correctness:
- DocOp::Merge variant: merges fields into existing docs instead of replacing
- All dump phases use Merge for object-level writes (fixes data loss bug)
- Tags post-pass: bitmap inversion writes one Merge per slot (4.5B→109M ops)
- 10 unit tests for Merge semantics (roundtrip, accumulate, delete+resurrect)

Pipeline performance (StreamingDocWriter fixes):
- BufWriter 256→8192 bytes on new shard creation (2x throughput improvement)
- Hardware CRC32 via crc32fast (replaces software byte-at-a-time table)
- Remove per-shard fsync in finalize (saves 20-80s per phase)
- Background enrichment drop (50s blocking → non-blocking)
- Mmap explicit drop after parse (zombie RSS 83GB→24GB)

DataSilo crate (crates/datasilo/):
- Generic mmap'd key-value store: 35M writes/sec, 23M reads/sec
- ParallelWriter with atomic bump + 1MB thread-local regions
- OpsLog with CRC32 append + replay on startup
- Compaction (replay ops → rewrite data file)
- 6 unit tests passing

Server endpoints:
- POST /time-buckets/rebuild: rebuild from sort field data + cache clear
- GET /dictionaries: reverse maps for LCS/MappedString fields
- GET /ui-config: serves YAML as JSON for config-driven UI

Config-driven UI (static/index.html):
- Dynamic filter/sort controls from engine metadata + YAML overrides
- Card rendering with image URL templates, badges, meta fields
- Detail modal with configurable fields, display types, formats
- URL state sync for bookmarkable/shareable filter states
- Civitai UI config (deploy/configs/civitai/ui-config.yaml)

Design docs:
- docs/design/docop-merge.md (GPT + Gemini reviewed)
- docs/design/datasilo-implementation-plan.md (full migration plan)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frozen query path (Task #33):
- BitmapSilo frozen accessors: get_frozen_filter(), get_frozen_sort_layer()
- mark_filters_backed() / mark_sorts_backed() — startup marks bitmaps as
  unloaded placeholders, reads from mmap at query time
- QueryExecutor: get_effective_bitmap() + and_effective_bitmap() helpers
  with frozen fallback for all filter ops (Eq, In, NotEq, NotIn, Or, Range)
- Sort traversal: bifurcate_frozen(), apply_cursor_filter_frozen(),
  reconstruct_value_frozen() — frozen layers from BitmapSilo mmap
- ConcurrentEngine holds BitmapSilo behind RwLock, passes to executor

Aggressive V2 retirement (~15K lines removed):
- Removed lazy loading: pending_filter_loads, pending_sort_loads,
  lazy_value_fields, ensure_fields_loaded(), LazyLoad enum
- Removed eviction: eviction_stamps, eviction_total, idle sweep
- Removed existence sets: existing_keys
- Deleted bitmap_memory_cache.rs, bitmap_fs.rs, bound_store.rs,
  doc_cache.rs, field_handler.rs, preset.rs, shard_store*.rs (4 files)
- Removed FilterField::load_field_complete(), load_values(),
  clear_bases_and_unload()
- Cleaned up 47 stale TODO comments (49→2)
- Deleted 8 dead test stubs, un-ignored 3 tests (0 ignored remaining)

635 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove dead SortField wrappers: bifurcate(), order_results(),
  apply_cursor_filter() — only frozen variants remain
- Remove dead FlushCommand fields: skip_lazy, cursors, dictionaries
- Remove dead docstore_root field from ConcurrentEngine
- Clean unused imports across datasilo, concurrent_engine, executor
- 635 tests passing, 0 failed, 0 ignored

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Arc<RoaringBitmap> on VersionedBitmap.base was for ArcSwap CoW
snapshot publishing. With V3 frozen mmap, published snapshots read
bases from BitmapSilo mmap, making the Arc unnecessary overhead.

- base: Arc<RoaringBitmap> → base: RoaringBitmap
- Removed from_arc() constructor
- Simplified merge(), or_into_base(), load_base() — direct mutation
- Updated all .base().as_ref() call sites to .base()
- diff: Arc<BitmapDiff> stays (still needed for swap_diff)

635 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Temporary get_shard/get_shard_packed shims on DocSiloAdapter to
unblock server compilation. These will be replaced with proper
get_document() API that reads from mmap + applies pending ops.

635 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Zero references to ShardStore, DocStoreV3, BitmapFs, doc_cache,
bound_store, or field_handler remain in the codebase.

- Removed DocCacheConfigEntry struct + doc_cache config field
- Removed 8 dead doc_cache metrics from metrics.rs
- Removed evict_doc_cache() + doc_cache_stats() stub methods
- Removed doc_cache metric scraping from server.rs
- Updated all comments from V2 system names to V3 (DataSilo/BitmapSilo)
- Updated test assertions from DocStoreV3 to DataSilo

635 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DocSink and Ingester<B> were V2 abstractions never used in production.
Keep BitmapSink trait, CoalescerSink, AccumSink (actively used).

631 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DataSilo::delete(key) appends Delete tombstone to ops log
- get_with_ops() respects delete tombstones (returns None)
- Cold compaction: deleted keys excluded from output data file
- Hot compaction: deleted keys have index entry zeroed out
- OpsLog::for_each_ops() yields full SiloOp (Put + Delete)
- Delete CRC validation in for_each()
- 4 new tests: cold delete, hot delete, get_with_ops delete, delete+reinsert

29 datasilo tests passing, 631 lib tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge thread now only does:
1. DataSilo compaction when dirty (apply pending doc ops)
2. RSS-aware memory pressure eviction

Removed: unused inner clone, time_buckets capture, cursors capture,
suppress-unused hacks. Named the thread "bitdex-merge".

631 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Added compact_threshold to SiloConfig (default 0.20 = 20%)
- Added dead_bytes counter on DataSilo (AtomicU64)
- Hot compaction tracks dead bytes from deletes (zeroed index entries)
  and relocating updates (overflows where old slot becomes dead)
- Cold compaction resets dead_bytes to 0 (full rewrite)
- Added dead_bytes(), dead_ratio(), needs_compaction() accessors
- BitmapSilo uses compact_threshold=0.0 (bitmaps rewritten in full)

29 datasilo tests, 631 lib tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cold compaction now uses mmap for both data and index writes:
1. Compute entry layouts sequentially (offsets are cumulative)
2. Pre-allocate data file at exact size and mmap it
3. Write entries via pointer copy to pre-computed offsets

Each entry targets a unique non-overlapping region, ready for
parallel writes (rayon) when needed. Currently sequential but
the infrastructure is in place — just change .for_each to
.par_iter().for_each() when rayon is added to datasilo.

29 datasilo tests, 631 lib tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Deleted pg_sync/backfill.rs entirely (no external callers)
- Deleted pg_sync/csv_ops.rs entirely (no external callers)
- Removed apply_ops_batch_dump + process_wal_dump from ops_processor.rs
- Removed 7 dead parse_*_row functions from copy_queries.rs
  (kept parse_post_row, parse_model_version_row, parse_model_row)
- Removed associated dead types: CopyImageRow, CopyResourceRow, CopyMetricRow

631 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New src/cache_silo.rs:
- CacheEntryData: serializable subset of UnifiedEntry (bitmap, metadata, sorted_keys)
- Binary format v1: fixed header + variable bitmap + optional sorted_keys
- hash_unified_key(): folds 64-bit hash to u32 for DataSilo key
- save_entry/delete_entry: append to ops log
- load_all: scan ops log + data file for last-write-wins restore
- compact: delegates to DataSilo compaction

Wiring in ConcurrentEngine:
- cache_silo field (Arc<RwLock<CacheSilo>>)
- Startup: open + load_all from bitmap_path/cache_silo/
- Flush thread: drain_dirty_for_silo() → save dirty entries after cache maintenance
- Merge thread: compact CacheSilo when dead space exceeds threshold

UnifiedCache additions:
- drain_dirty_for_silo(): collects dirty entries as (key_hash, CacheEntryData)

8 new CacheSilo tests, 639 total tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CacheSilo restore fix:
- Added UnifiedKey serialization to CacheEntryData binary format (v2)
- Added key field to CacheEntryData (encode/decode round-trips the key)
- Wired actual restore path: load_all → from_cache_entry_data → insert_restored_entry
- Added UnifiedEntry::from_cache_entry_data() constructor
- begin_restore/finish_restore for batch eviction

Dead enrichment code removal:
- Removed PostEnrichment, MvEnrichment, ModelEnrichment structs
- Removed load_posts_enrichment, load_mv_enrichment, load_model_enrichment
- Removed CopyPostRow, CopyModelVersionRow, CopyModelRow + parse functions
- Removed dead helper functions (is_null, parse_opt_*, parse_bool, parse_i64_fast)

639 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
compact_hot() previously truncated the single ops log after compaction,
losing any ops written during the compaction window. Fix: two ops log
slots (ops_a.log, ops_b.log) with atomic swap.

Protocol:
1. Freeze active slot, redirect writes to other slot (atomic xor)
2. Compact data from frozen slot
3. Truncate frozen slot only after data+index fully flushed

Legacy migration: existing ops.log renamed to ops_a.log on first open.

Tests: test_ab_swap_no_ops_lost, test_legacy_ops_log_migration.
31 datasilo tests, 639 lib tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed in compact_hot_from():

1. Reader blocking: old code dropped self.data_mmap during compaction,
   causing get() to return None. Fix: write to data.bin.tmp while old
   mmap stays alive, then rename over data.bin.

2. Data/index interleaving: old code wrote data AND updated index in
   same loop body. Crash mid-loop = corrupt state. Fix: three strict
   phases — classify (read-only), write data (tmp file), update index
   (only after data flushed).

Dead-space accounting also fixed: captures old_allocated during the
read-only classification pass before any mutations.

Tests: test_hot_compact_does_not_drop_read_mmap_early,
       test_hot_compact_data_before_index_sequential_rounds

33 datasilo tests, 639 lib tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- BitmapSilo compact_threshold: 0.0 → 0.20 (20% dead space triggers)
- Added compact() and needs_compaction() to BitmapSilo
- Merge thread now round-robins across doc, cache, and bitmap silos
- bitmap_silo_arc created early for sharing with merge thread

639 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scarlet audit: previous hot compaction copied the ENTIRE data file to a
temp file on every cycle — 25GB memcpy at 107M docs.

Fix: two-tier approach:
- In-place updates: seek+write to existing data.bin at allocated offsets
- Overflows: append to end of existing data.bin (old slot = dead space)
- Full file rewrite only when dead_ratio > compact_threshold (separate pass)
- Never copy the entire file for routine compaction

No temp file, no rename. data_mmap remaps only when file grows (overflows).
In-place path doesn't touch the mmap at all — readers unblocked throughout.

33 datasilo tests, 639 lib tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BitmapSilo true silo (Phase 2 foundation):
- Ops encoding: OP_SET_BIT (0x01) and OP_CLEAR_BIT (0x02) for individual
  bit mutations, alongside existing full frozen bitmap format
- Mutation methods: filter_set/clear, sort_set/clear, alive_set/clear
  — append 5-byte ops to silo ops log
- Ops-on-read: get_filter_with_ops, get_sort_layer_with_ops, get_alive_with_ops
  — read frozen base + scan ops for pending set/clear, apply inline
- DataSilo.scan_ops_for_key() — scan both A-B logs for all ops on a key

Dead stubs cleanup (Phase 5 partial):
- Deleted memory_pressure.rs + all references
- Deleted get_rss_bytes() + Windows/Linux FFI from concurrent_engine
- Deleted dead stubs: boundstore_*, preload_*, build_all_from_docstore,
  rebuild_fields_from_docstore, add_fields_from_docstore, etc.
- Merge thread: removed RSS eviction loop (no heap data to evict)
- Removed rebuild_on_boot from server.rs

636 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
get_effective_bitmap now reads from BitmapSilo first (frozen base +
pending silo ops), then merges with in-memory VersionedBitmap diffs
for mutations not yet written to the silo. During the Phase 2→4
transition both sources may have data; union combines them.

and_effective_bitmap simplified to delegate to get_effective_bitmap.

636 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added send_mutation_ops() helper that dual-writes every MutationOp
to both the BitmapSilo ops log (V3 path) and the coalescer channel
(V2 path, removed in Phase 4). All 6 mutation entry points wired.

Filter, sort, AND alive mutations all go to the silo ops log:
- FilterInsert/Remove → silo.filter_set/clear per slot
- SortSet/Clear → silo.sort_set/clear per slot
- AliveInsert/Remove → silo.alive_set/clear per slot

Combined with the executor ops-on-read from the previous commit,
this means the silo now has complete mutation data AND reads apply
it. The coalescer/ArcSwap path is now redundant (Phase 4 removes it).

636 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Query path now checks CacheSilo before UnifiedCache:
1. Hash UnifiedKey → key_hash
2. If not in UnifiedCache, try cache_silo.get_entry(key_hash)
3. On silo hit: promote to UnifiedCache via from_cache_entry_data
4. Downstream logic (sorted_keys, radix, bucket diffs) works unchanged

New: CacheSilo.get_entry(key_hash) — single-key read via get_with_ops
New: silo_hits metric for tracking cross-restart cache effectiveness
Write path unchanged: flush thread drain_dirty_for_silo still handles persistence

4 new get_entry tests. 640 total tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- BitmapSilo: RwLock on name_to_key for concurrent key auto-creation
  (new bitmap values auto-assign silo keys instead of silently skipping)
- send_mutation_ops(): skip coalescer when bitmap_silo exists
  (mutations go ONLY to silo ops log for engines with a silo)
- get_effective_bitmap(): simplified to silo-first, VB fallback for tests
- Removed V2 lazy-load test (tested flush thread mechanics, N/A with silo)

Phase 4 foundation: with silo-only mutations, the coalescer path is
now dead for production engines. Tests without silos still use it.

640 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MutationOp and MutationSender now live in mutation.rs (their natural
home) instead of write_coalescer.rs. Updated all imports across
concurrent_engine, ingester, ops_processor. write_coalescer.rs now
imports from mutation.rs — preparation for deleting the coalescer.

640 tests passing, 0 failed, 0 ignored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WriteCoalescer batching system replaced by direct silo ops log writes.
Flush thread uses local FlushBatch struct for remaining staging updates.
MutationOp + MutationSender already moved to mutation.rs.
FilterGroupKey moved to unified_cache.rs.

615 tests passing, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UnifiedCache replaced entirely by CacheSilo:
- Query path reads cache via CacheSilo.get_entry() only
- No in-memory HashMap, no radix sort index, no LRU tracking
- UnifiedKey moved to cache_silo.rs
- Flush thread live maintenance removed (~1,800 lines from concurrent_engine)
- Prefetch worker removed
- Cache stats/metrics simplified to CacheSilo-only

Total removed this commit: ~5,200 lines (unified_cache.rs + flush thread code)
561 tests passing, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FlushCommand (ForcePublish, SyncUnloaded, ExitLoadingSaveUnload) and
the cmd_tx/cmd_rx command channel are gone. loading_mode AtomicBool
and all enter/exit methods removed.

- enter_loading_mode() / exit_loading_mode() → no-ops
- exit_loading_mode_and_save_unload() → just calls save_snapshot()
- save_and_unload() → calls publish_staging directly
- Flush thread simplified: no command handling, no loading mode checks
- 2 V2 tests deleted (loading mode timing tests)

559 tests passing, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ArcSwap<InnerEngine> replaced with direct RwLock fields:
- slots: Arc<RwLock<SlotAllocator>>
- filters: Arc<RwLock<FilterIndex>>
- sorts: Arc<RwLock<SortIndex>>

Queries hold read locks. Flush thread holds write locks for mutation
application only. No more staging clone, no snapshot publishing.

Bulk-load paths (clone_staging/publish_staging) still work via
read-lock clone → offline build → write-lock swap.

559 tests passing, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes the following methods that are no longer part of the engine API:
- put_via_wal, patch_document_via_wal (WAL write path — superseded by ops pipeline)
- put_inner (inlined into put())
- patch, patch_document (PATCH semantics — use PUT for all writes)
- sync_filter_values (filter_only sync — use PUT for all writes)
- put_many, put_bulk, put_bulk_loading, put_bulk_into (bulk loading)
- spawn_docstore_writer, write_docs_to_docstore (docstore helpers)
- apply_accum (BitmapAccum apply — superseded by apply_bitmap_maps)
- wal_writer field and set_wal_writer (WAL path removed)

Keeps: put(), delete(), clone_staging(), publish_staging(), apply_bitmap_maps()
— these are still used by dump_processor, loader, and remove_fields.

Server PATCH and filter_sync handlers now return 501 Not Implemented.
Removes 7 tests that covered the deleted methods.
Benchmark "bulk" stage replaced with a no-op placeholder.

cargo check --lib: 0 errors
cargo check --features server,pg-sync: 0 errors
cargo test --lib: 548 passed, 0 failed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Converted concurrent_engine.rs to directory module. Extracted 6 query
methods to src/concurrent_engine/query.rs:
- query(), execute_query(), execute_query_impl()
- execute_query_traced(), execute_query_with_collector()
- resolve_filters(), post_validate()

concurrent_engine/mod.rs is now the engine struct + construction + mutations.
concurrent_engine/query.rs is the query execution path.

548 tests passing, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JustMaier and others added 24 commits April 4, 2026 21:48
…sent

When has_silo is true, the flush thread skips acquiring write locks on
FilterIndex/SortIndex/SlotAllocator and applying mutations — those go
directly to BitmapSilo instead. Also removes the dead stale_fields
collection (collected then immediately cleared with no reads between)
and the flush_cycle counter + COMPACTION_INTERVAL constant which were
only used for dead compaction logic.

Shutdown/final-drain path mirrors: apply only when !has_silo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
refactor: skip in-memory bitmap apply when BitmapSilo present
…lo (#140)

feat: add FieldRegistry + deterministic u64 key encoding for BitmapSilo
…ewrite save_snapshot to use live silo

- Remove pending_bucket_diffs field from ConcurrentEngine — it was only
  passed to the flush thread on startup and never read back. The flush
  thread still owns and updates the ArcSwap<PendingBucketDiffs> locally.

- Rewrite save_snapshot to use the live BitmapSilo when present. Instead
  of reading all in-memory bitmaps and calling save_all_parallel (which
  re-serialized everything), the ops-on-read path only needs to flush
  metadata (slot_counter, cursors) and compact the ops log. Added
  BitmapSilo::save_meta() for this purpose.

- save_and_unload now calls save_snapshot first, ensuring metadata is
  persisted before the in-memory state is reset.

- Mark BitmapSilo::save_all as #[cfg(test)] — it's only used in test
  fixtures to create frozen silo snapshots. Production path is write_dump_maps
  or the ops-on-read individual set/clear ops.

- Delete snapshot_public from ConcurrentEngine — was only used by the
  benchmark to get a sort field handle. Replace with reconstruct_sort_value
  public method on ConcurrentEngine.

- Fix benchmark.rs import path: bitdex_v2::concurrent_engine ->
  bitdex_v2::engine.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ap_maps

Delete InnerEngine struct, clone_staging(), publish_staging(), and
apply_bitmap_maps() — the staging clone-then-swap pattern required
cloning all three live indexes (filters, sorts, slots) just to build
bitmaps offline, adding memory pressure during bulk loads.

Replace with engine.merge_bitmap_maps(filter_maps, sort_maps, alive)
which applies pre-built bitmaps directly to the live state under brief
write locks. This is equivalent to the old staging pattern but without
the full InnerEngine clone.

Update loader.rs Stage 3 to use merge_bitmap_maps directly instead of
accumulating into a staging snapshot then publishing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove cache_silo from FlushArgs — the flush thread no longer does
  any cache maintenance (that moved to the CacheSilo/ops-on-read path).
  CacheSilo is still owned by ConcurrentEngine and used by the query path.

- Remove flush_cycle counter and COMPACTION_INTERVAL constant from the
  flush thread — filter diff compaction was removed in a prior session
  and these variables were just incrementing with no effect.

- Remove has_alive_mutations() and mutated_filter_fields() from FlushBatch
  — neither was called anywhere in the codebase.

- Fix save_snapshot fallback: when no live silo exists (engine started
  fresh with no prior snapshot), fall back to save_all_parallel to
  serialize the in-memory state to a new silo file. This restores the
  test_save_snapshot_* test behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Delete copy_field_arc_from() from both FilterIndex and SortIndex —
these were used by the old save_and_unload path when preserving field
Arcs across the staging swap. No callers remain in the codebase.

Delete total_bitmap_count() from FilterIndex — never called outside
the dead snapshot V2 code path.

Suppress dead_code warning on ConcurrentEngine::doc_tx — the field is
a channel sender used by the test put() helper to drive docstore writes
through the flush thread, so it is a real field. The linter can't see
the test usage due to cfg(test) scoping.

Also remove the stale ConcurrentEngine ArcSwap doc-comment banner —
we no longer publish immutable snapshots via ArcSwap (removed in
prior sessions).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
refactor: delete V2 in-memory infrastructure (InnerEngine, staging, dead methods)
…poch gaps)

Fix 1: alive_count() and reconstruct_sort_value() in ConcurrentEngine now read
from BitmapSilo when present instead of stale in-memory SlotAllocator/SortIndex.

Fix 2: bump_field_epochs() now tracks AliveInsert/AliveRemove under the
"__alive__" field key. Cache seeding in query.rs now includes the __alive__
epoch so inserts/deletes correctly invalidate cached results.

Fix 3: QueryExecutor::alive_count() now derives from alive_bitmap() via the
OnceCell, ensuring per-query consistency with the cached alive set.

Fix 4: merge_bitmap_maps() now routes writes to the silo via write_dump_maps()
when a BitmapSilo is present (skipping stale in-memory writes), and falls back
to the in-memory path only when no silo is configured.

Fix 5: Flush thread time bucket maintenance now uses frozen_reconstruct_value()
from the silo when has_silo is true, instead of reading from the in-memory
SortIndex that is not updated on the silo path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: 5 correctness bugs from external review (stale reads, cache epochs, silo bypass)
…Silo

Part 1 — DataSilo u32 → u64 key upgrade:
- ops log frame format: [tag:1][key:4][len:4][value][crc32:4] →
  [tag:1][key:8][len:4][value][crc32:4] (breaking format change)
- Replace flat array index with HashIndex (open-addressed mmap hash table)
  that already uses u64 keys — eliminates the 2^32 slot limit
- All public API signatures updated: append_op, append_ops_batch, delete,
  get, get_with_ops, write_batch_parallel, scan_ops_for_key
- compact_cold_from / compact_hot_from rewritten for HashIndex
- CacheSilo: hash_unified_key returns u64 (no fold), load_all uses
  iter_index_keys instead of 0..index_cap iteration
- DocSiloAdapter: slot as u64 casts at call sites

Part 2 — FieldRegistry + deterministic u64 keys in BitmapSilo:
- Add field_registry: parking_lot::Mutex<FieldRegistry> to BitmapSilo
- Load/create FieldRegistry in open(); save alongside manifest
- Replace format!("filter:{}:{}") + ensure_key() with
  ensure_field_id() + encode_filter_key(field_id, value) in all
  mutation methods (filter_set/clear, sort_set/clear, save_all,
  save_all_parallel, write_dump_maps, ParallelBitmapWriter)
- Frozen accessors (get_frozen_filter, get_frozen_sort_layer) and
  ops-on-read (get_filter_with_ops, get_sort_layer_with_ops) now use
  FieldRegistry encoding for point lookups
- Enumeration paths (load_filters, load_sorts, filter_entries,
  filter_values_for_field, mark_filters_backed, mark_sorts_backed)
  retain legacy name_to_key manifest — populated by all write paths
  to maintain backward-compat enumeration
- All 450 lib tests pass, zero warnings in main crate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: DataSilo u64 keys + FieldRegistry wiring into BitmapSilo
…Silo

Cherry-pick of 4f8d4da from feat/datasilo-u64-wiring (PR #144).

Part 1 — DataSilo u32 → u64 key upgrade:
- ops log frame format: [tag:1][key:4][len:4][value][crc32:4] →
  [tag:1][key:8][len:4][value][crc32:4]
- Replace flat array index with HashIndex (open-addressed mmap hash table)
  that already uses u64 keys — eliminates the 2^32 slot limit
- All public API signatures updated: append_op, append_ops_batch, delete,
  get, get_with_ops, write_batch_parallel, scan_ops_for_key
- CacheSilo: hash_unified_key returns u64 (no fold), load_all uses
  iter_index_keys
- DocSiloAdapter: slot as u64 casts at call sites

Part 2 — FieldRegistry + deterministic u64 keys in BitmapSilo:
- Add field_registry: parking_lot::Mutex<FieldRegistry> to BitmapSilo
- Load/create FieldRegistry in open(); save alongside manifest
- Replace format!("filter:{}:{}") + ensure_key() with
  ensure_field_id() + encode_filter_key(field_id, value) in all
  mutation methods (filter_set/clear, sort_set/clear, save_all,
  save_all_parallel, write_dump_maps, ParallelBitmapWriter)
- Enumeration paths retain legacy name_to_key manifest (populated by
  all write paths) for backward-compat enumeration
- Fix dead_code warning on REGION_SIZE (test-only constant)

Dump processor flow verified: dump builds bitmaps → passes String field
names to write_dump_maps → write_dump_maps calls ensure_field_id()
(FieldRegistry.ensure()) to get u16 field_id → encode_filter_key(field_id,
value) / encode_sort_key(field_id, bit_idx) → u64 silo key → DataSilo.
No changes needed in dump_processor.rs or loader.rs — the API surface
(HashMap<String, ...> inputs) is unchanged; key encoding moved inside
write_dump_maps.

cargo check --lib: clean (0 errors, 0 warnings)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: DataSilo u64 keys + FieldRegistry wiring + dump processor V3 verified
DataSilo's write_put, write_put_reuse, and append_ops_batch APIs were
upgraded to u64 keys, but dump_processor.rs still passed u32 slot values.
Fixes all 8 sites: ThreadResult type alias, doc_ops/all_doc_ops/mv_ops
Vec types, write_put_reuse/write_put call sites, and test Vec declarations.

Resolves all 5 E0308 compile errors under `--features "server,pg-sync"`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: u32→u64 casts for DataSilo callers in dump_processor (server+pg-sync)
- filter_field_to_idx and filter_tuples use u16 instead of u8,
  removing the silent 255-field cap
- serialize_frozen_into errors now logged instead of silently discarded

From FOLLOWUP.md review items.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_entries=0

- bitdex.default.toml: rayon_threads = 24 (optimal for 16-core, avoids HT contention)
- server.rs: reads rayon_threads from TOML, sets RAYON_NUM_THREADS env var at startup
- query.rs: max_entries=0 or max_bytes=0 disables caching entirely

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-value fields (tagIds, toolIds, techniqueIds) now use the same
per-row Merge doc op path as standard fields. Each row emits a
single-element Mi([value]) array; compaction concatenates arrays
for the same slot via the existing Mi merge semantics.

Eliminates:
- is_multi_value_only detection (hardcoded field name list)
- Bitmap inversion post-pass (~50 lines) that rebuilt per-slot arrays
  from the merged filter bitmaps after parse
- Separate parallel/sequential write paths for multi-value doc ops

Adds:
- DumpFieldValue::MultiInt variant + write_field_multi_int encoding
- DocValueType::MultiInt in the compiled doc field plan
- multi_value_fields HashSet built from config FilterFieldType::MultiValue

450 tests pass. Net -35 lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…phase

The unified per-row Merge doc op path (f842a72) caused 4.5B doc ops for
the tags phase, hitting 70GB RSS in 25 seconds. Multi-value-only phases
(tags, tools, techniques) must skip per-row doc ops and use the bitmap
inversion post-pass instead (~108M ops vs 4.5B).

Detection now uses config-driven multi_value_fields HashSet instead of
the old hardcoded field name list. The DumpFieldValue::MultiInt variant
and DocValueType::MultiInt are kept for future use when per-slot
accumulation during parse becomes feasible.

450 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…post-pass

All dump phases now stream doc ops directly to the DocSilo mmap via
ParallelOpsWriter during parse. No in-memory Vec accumulation. Multi-value
fields emit Mi([value]) per row; compaction concatenates per slot.

The is_multi_value_only flag and bitmap inversion post-pass are fully
removed. This eliminates ~50 lines of special-case code.

Fix for prior OOM: the parallel ops writer mmap size estimate now uses
40 bytes/row for single-field multi-value phases (tags: ~30 bytes actual)
vs 400 bytes/row for standard phases. This prevents over-allocation
while ensuring the mmap is large enough for streaming writes.

450 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 1: Slot ID 0 mapped to DataSilo key 0, which is the HashIndex empty
sentinel. Fix: slot_to_key(slot) = slot + 1. Applied in DocSiloAdapter
and all dump_processor parallel write sites.

Bug 2: Parallel ops writer mmap size was estimated as body.len()/100 rows,
but tags CSV has ~14 byte lines (not 100), so 63GB/100 = 630M estimated
vs 4.5B actual. Fix: sample first 4KB to compute actual average line
length. Tags now estimates ~4.5B rows × 40 bytes = ~180GB.

450 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-value CSV rows (tags, tools, techniques) are sorted by imageId.
Instead of writing one Merge(Mi([tagId])) per row (4.5B ops for tags),
accumulate values while the slot stays the same and flush one
Merge(Mi([all_tags])) when the slot changes.

4.5B per-row ops → ~109M per-slot ops. Mmap estimate drops from
~180GB to ~5GB. Compaction RSS from ~104GB to manageable.

Standard phases (images) are unaffected — they use the direct
per-row path since each row has a unique slot.

450 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…action

Eliminates the ops log + compaction bottleneck for doc writes during dump.
After the images phase creates data.bin via write_batch_parallel, subsequent
phases (tags, tools, techniques, resources) use DumpMergeWriter to read
existing doc records, merge Mi arrays in-place, and write back — no ops log,
no cold compact needed.

Also fixes a correctness bug where compaction used last-write-wins for
duplicate keys, silently dropping earlier Merge ops. DocSiloAdapter now
sets a merge function that concatenates Mi fields during compaction.

Changes:
- DumpMergeWriter: striped-lock (1024 buckets) concurrent read-modify-write
- HashIndex::update_existing_concurrent: thread-safe entry update
- merge_encoded_docs: decode two doc records, fuse Mi arrays, re-encode
- DataSilo::set_merge_fn: optional merge callback for compaction
- compact_cold_merge: merge-aware cold compaction path
- compact_hot_from: merge duplicate ops + merge with existing data file
- dump_processor: auto-detects DumpMergeWriter availability per phase

11 new tests (3 DumpMergeWriter, 2 merge_encoded_docs, 3 compaction, 3 compat)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@JustMaier JustMaier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #148 Review: DumpMergeWriter + merge-aware compaction

Overall this is well-structured -- the dump merge writer is a sound optimization (bypass ops log for subsequent phases), and the merge-aware compaction fixes a real correctness bug. The test coverage is solid for the happy paths. I found several issues that need attention before merge, ranging from potential data corruption to silent data loss.


1. Critical: Read/write mmap aliasing on the same file (UB risk)

prepare_dump_merge creates a second MmapMut (the write mmap) on the same data.bin file that self.data_mmap (the read mmap) already maps. The DumpMergeWriter then reads from read_ptr (the DataSilo read-only mmap) and writes to write_ptr (the new writable mmap).

Two mmaps on the same file from the same process is not undefined behavior per se, but the OS makes no guarantees about visibility timing between them. On Linux, both mappings typically share the same page cache, so writes through write_ptr are visible through read_ptr immediately. On Windows, this is less guaranteed -- separate MapViewOfFile calls may or may not share pages.

This matters because merge_put reads existing data from read_ptr, merges, and writes to write_ptr. If a prior merge_put wrote to the same slot (which should not happen in normal dump flow but could happen with retry logic or duplicate keys in CSV), the read would see stale data from the read mmap.

Recommendation: Document the single-write-per-key invariant explicitly in the safety contract. Consider whether the read path should use write_ptr instead of read_ptr to guarantee read-your-own-writes, which would eliminate the aliasing concern entirely -- the read mmap would only be needed for its length.

2. High: update_existing_concurrent writes partial fields without atomicity

The update_existing_concurrent method writes offset, length, and allocated as three separate copy_nonoverlapping calls. If a concurrent reader (e.g., a get() call from another thread) reads the entry between writes, it could see an inconsistent state -- new length with old offset, or vice versa.

During dump phases this is likely safe because queries are not running, but this method is pub and its safety contract only says "no two threads call this with the same key simultaneously" -- it says nothing about concurrent readers. A get() on a key being updated could observe torn reads.

Recommendation: Either (a) write all three fields in a single copy_nonoverlapping of 16 bytes (offset+length+allocated are contiguous at +8..+24), or (b) document in the safety contract that concurrent reads of the same key are also unsafe.

3. Medium: merge_encoded_docs error silently falls back to new.to_vec()

Throughout dump_processor.rs, the merge callback is:

|existing, new| {
    merge_encoded_docs(existing, new)
        .unwrap_or_else(|_| new.to_vec())
}

If decoding fails (corrupted existing data, truncated write, etc.), this silently drops the existing doc and replaces it with just the new fragment. For a tags phase, this means the base image doc fields (userId, sortAt, etc.) would be silently lost and replaced with just Mi([tagId]).

Same pattern in doc_silo_adapter.rs line ~798 for the compaction merge function.

Recommendation: At minimum, log the error with the key so it is diagnosable. Better: count decode failures in an atomic counter on DumpMergeWriter (like overflow_count) so the post-phase summary reports them.

4. Medium: Hot compaction merge order -- correct but needs a comment

In compact_hot_from (around line 1024-1033), after merging ops together, the code merges with existing data as:

*new_value = merge_fn(existing_bytes, new_value);

This calls merge_fn(existing_from_data_file, merged_ops_value). In merge_encoded_docs, new_data fields replace existing fields (except Mi which concatenates). So the argument order matters: the "new" ops should be in the new_data position, and the data file entry should be in the existing position. This is correct.

However, the ops-internal merge also uses the same argument convention: first op is "existing", second op is "new". If both ops set the same non-Mi field, the second op wins (correct LWW within an ops batch). Then the full merge treats the data file as "existing" and the ops-merged value as "new", so ops correctly override data file values. This is correct but subtle -- a comment clarifying the argument order semantics in compact_hot_from would prevent future mistakes.

5. Medium: compact_cold_merge calls write_batch_parallel which replaces the entire data file

compact_cold_merge collects all merged ops into a HashMap, then calls write_batch_parallel to write them. write_batch_parallel creates a new data.bin and index.bin from scratch. This is correct for cold compaction (no existing data file), but if called when there IS an existing data file (which should not happen because compact() routes to compact_hot_from when data exists), it would overwrite the data file with only the ops entries, losing all other keys.

This is protected by the routing logic in compact(), but it is a latent footgun. Consider adding a debug assertion in compact_cold_merge that self.data_mmap.is_none().

6. Low: flush() is a no-op

DumpMergeWriter::flush() returns Ok(()) with a comment saying "the mmap will flush on drop." This is misleading -- callers might expect flush() to actually persist data. The MmapMut is stored in _write_mmap but flush() cannot access it through the raw pointer.

Recommendation: Either make flush() actually call self._write_mmap.flush() (change _write_mmap to not be prefixed with _ and access it directly), or remove the method and document that data is persisted on drop.

7. Low: Linear scan in merge_encoded_docs field lookup

if let Some(entry) = fields.iter_mut().find(|(f, _)| *f == field_idx) {

This is O(n) per field in the new record, making the merge O(n*m) where n = existing fields and m = new fields. For doc records with ~15 fields this is negligible. Fine as-is, just noting it.

8. Nit: merge_encoded_docs_into is declared but never called

The zero-allocation variant merge_encoded_docs_into that writes into a caller-provided buffer is defined but all call sites use merge_encoded_docs (which allocates). Was this intended for future use from the DumpMergeWriter hot path?


Summary

The architecture is sound -- DumpMergeWriter for direct in-place writes during dump phases is a clear win over the ops log + compaction path. The merge-aware compaction fix is necessary for correctness. Main concerns:

  1. Use write_ptr for reads too (or document the aliasing invariant)
  2. Single 16-byte write in update_existing_concurrent (torn read risk)
  3. Do not silently swallow decode errors in the merge callback
  4. flush() should actually flush or be removed

Tests are thorough for the new code paths. The fallback to ops log when DumpMergeWriter is None looks correct -- all existing collect_doc_op call sites pass None for the new parameter in the non-merge path.

Copy link
Copy Markdown
Contributor Author

@JustMaier JustMaier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: DumpMergeWriter + merge-aware compaction

Good structural approach — bypassing the ops log for subsequent dump phases is the right move for throughput, and the striped-lock design is straightforward. Several issues below, ranging from a potential soundness problem to minor nits.


1. Dual-mmap aliasing in DumpMergeWriter (soundness concern)

prepare_dump_merge() opens a second MmapMut on the same data.bin file that the DataSilo already holds a read-only Mmap for (self.data_mmap). The DumpMergeWriter then:

  • Reads from read_ptr (the DataSilo's existing read mmap)
  • Writes to write_ptr (the new writable mmap)

Both point to the same file. On Linux, the kernel maps the same physical pages for both, so writes through write_ptr are visible through read_ptr — this works. On Windows, behavior is less guaranteed with independent mmap handles on the same file.

More importantly: there is no memory barrier between the write via write_ptr and the subsequent read via read_ptr in merge_put. If thread A writes slot X, then thread B reads slot X (different stripe bucket because slot X hashes differently for B's merge), the write may not be visible through the read mmap. In practice the stripe lock provides ordering for same-key access, and different-key access reads different offsets, so this is likely safe — but the aliasing of two separate mmaps over the same file is technically UB under Rust's aliasing rules.

Suggestion: Consider using a single MmapMut and reading from the same pointer (cast to *const u8 for reads). This eliminates the aliasing question entirely. The read mmap in DataSilo isn't needed during dump since no queries run concurrently.


2. merge_encoded_docs slot extraction assumes Merge op layout

let slot = if existing.len() >= 5 {
    u32::from_le_bytes(existing[1..5].try_into().unwrap())
} else {
    0
};

This hard-codes that byte 0 is the op tag and bytes 1..5 are the slot. It works for OP_TAG_MERGE records, but:

  • No validation of byte 0: If existing is a Create op or corrupted data, slot extraction silently returns a wrong value.
  • Fallback to slot=0: If existing.len() < 5, slot becomes 0 — but slot_to_key maps slot 0 to key 1 (via SLOT_KEY_OFFSET), so this wouldn't cause a key=0 sentinel collision. However, a record claiming to be slot 0 when the real slot is something else is still a silent data corruption.

Suggestion: Either assert/validate that existing[0] == OP_TAG_MERGE, or extract slot from the decode_doc_fields path (which already parses the op). The perf cost of one extra byte check is negligible.


3. merge_encoded_docs_into allocates despite the name

The doc comment says "Zero allocation except for the field Vec decode" — but decode_doc_fields calls decode_doc_op which allocates Vec<(u16, PackedValue)> and potentially allocates for string/Mi values. Then encode_merge_fields_into allocates the encoded output. In the hot path (dump_processor calling merge_put), the merge_fn closure calls merge_encoded_docs (not _into), which also allocates the return Vec<u8>.

This is not a bug, but the naming is misleading. Consider either:

  • Actually making _into zero-alloc (reuse a thread-local field vec), or
  • Dropping the "zero allocation" claim from the doc

Also: merge_encoded_docs_into is defined but never called — all callsites use merge_encoded_docs. Dead code?


4. Hot compact merge ordering — verified correct

In compact_hot_from, ops are accumulated left-to-right (merge_fn(accumulated, new_op)), then the accumulated result is merged with the data file (merge_fn(data_file, accumulated)). For Mi concatenation this yields [data_file_values, op1_values, op2_values, ...] — chronologically correct. The test confirms: "base+add1+add2". Good.


5. flush() is a no-op (misleading API)

pub fn flush(&self) -> io::Result<()> {
    Ok(())
}

This silently succeeds without flushing anything. Callers who call flush() expecting data to be durable will be misled. The mmap is stored as _write_mmap — you could expose a method that calls self._write_mmap.flush() instead.

Suggestion: Either make it actually flush, or remove the method and document that callers must drop the writer to trigger the mmap flush.


6. Overflow handling: silent data loss in dump_processor

When merge_put returns false (overflow — merged data exceeds allocated buffer), the dump_processor silently continues:

mw.merge_put(key, &doc_encode_buf, |existing, new| {
    crate::silos::doc_format::merge_encoded_docs(existing, new)
        .unwrap_or_else(|_| new.to_vec())
});

The return value is discarded at every callsite. The overflow count is logged at the end of the phase, but by then the data is lost — those slots will have incomplete Mi arrays.

Suggestion: At minimum, collect overflow keys into a Vec and fall back to ops-log writes for those entries. Or, if the overflow rate is expected to be zero in practice (because buffer_ratio provides enough headroom), add an assertion or at least a per-row warning log so the first overflow is immediately visible.


7. unwrap_or_else in merge_fn swallows decode errors

This pattern appears 5+ times:

doc_format::merge_encoded_docs(existing, new)
    .unwrap_or_else(|_| new.to_vec())

If decode_doc_fields fails on the existing record (corrupt data, truncated write), the merge silently replaces the entire existing record with just the new data. This loses all previously accumulated fields for that slot.

For dump phases where data integrity matters, consider at least logging when this fallback triggers. A decode failure on an existing record in data.bin suggests something went wrong in a prior phase.


8. collect_doc_op allocates on the merge_writer path

The merge_writer branch in collect_doc_op uses encode_merge_fields (allocating) instead of encode_merge_fields_into (reusing doc_buf):

if let Some(mw) = merge_writer {
    let bytes = crate::silos::doc_format::encode_merge_fields(slot, &fields);
    mw.merge_put(key, &bytes, ...);
}

The parallel-ops-writer path reuses doc_buf via encode_merge_fields_into. For consistency and throughput, the merge writer path should do the same — especially since this runs in a rayon hot loop at 100M+ rows.


9. Minor: compact_cold_merge uses write_batch_parallel

let batch: Vec<(u64, Vec<u8>)> = entries.into_iter().collect();
let count = self.write_batch_parallel(&batch)?;

write_batch_parallel rebuilds the entire data file and index from scratch. For cold compaction that is fine (there is no existing data file, or it is being replaced). But the doc comment doesn't make it clear that cold-merge is a full rewrite. Worth a one-line comment.

Also: the entries HashMap iteration order is nondeterministic. If write_batch_parallel cares about key ordering (e.g., for sequential I/O during reads), this could cause suboptimal layout. Probably fine for dumps, but worth noting.


10. Things done well

  • Striped lock design is clean and well-documented. 1024 stripes with key % N is a solid choice for rayon workloads.
  • Test coverage is good: basic, overflow, 1000-entry concurrent, cold merge, hot merge, and LWW backwards compat. The hot merge test validates the non-obvious three-way merge ordering.
  • Fallback strategy (merge writer when data.bin exists, ops log otherwise) handles the images-phase bootstrap correctly.
  • Code organization: keeping merge_encoded_docs in doc_format.rs and the merge plumbing in datasilo is the right separation of concerns.
  • The encode_dump_merge hoisting in dump_processor (encoding once, then choosing write path) simplifies the control flow vs the old duplicated encode calls.

Summary

Must fix before merge:

  • Issue #6 (overflow return value silently discarded — data loss)
  • Issue #5 (flush() is a no-op — either implement or remove)

Should fix:

  • Issue #2 (validate op tag before slot extraction)
  • Issue #7 (log when decode fallback triggers)
  • Issue #8 (use encode_merge_fields_into for merge writer path)

Consider:

  • Issue #1 (single mmap to avoid aliasing question)
  • Issue #3 (remove dead merge_encoded_docs_into or use it)

JustMaier and others added 3 commits April 5, 2026 22:36
- Eliminate dual-mmap aliasing: single MmapMut for both reads and writes
- Single 16-byte copy in update_existing_concurrent (prevent torn reads)
- Log decode errors in merge callbacks instead of silent fallback
- Add decode_error_count to DumpMergeWriter stats
- Make flush() actually call mmap.flush()
- Remove unused merge_encoded_docs_into

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ollect_doc_op

- merge_encoded_docs now validates OP_TAG_MERGE/CREATE before slot extraction
  (returns error on corrupt data instead of silently using wrong slot)
- collect_doc_op merge_writer path reuses scratch buffer via encode_merge_fields_into
  (avoids per-call allocation, matches the parallel-ops-writer path)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The merge_fn caused ALL cold compacts to take the slow merge path
(copies 5.7GB of ops data into HashMap). For the images phase dump
(14M+ unique keys, no duplicates), the merge function is never called —
we were just paying the copy cost for nothing.

Now: zero-copy scan always runs first, detects duplicate keys, and only
falls back to the merge-aware path when merging is actually needed.
Images phase cold compact should be fast again.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant