Context
Discovered during review of #97 (skip untouched files in per-file extras).
Description
In loader.py:_write_per_file_extras, edge stats are counted two different ways:
READS_ATOM / WRITES_ATOM (lines 835, 844): use session.run("MATCH ()-[r:X]->() RETURN count(r)") — counts all edges of that type in the entire DB.
READS_ENV / HANDLES_EVENT / EMITS_EVENT (lines 853, 861, 869): use len(local_list) — counts only edges from the current batch.
In full-index mode (DB wiped first), these produce the same result. In incremental mode (touched_files set), they diverge: DB-wide counts include pre-existing edges from prior runs, while len() only counts edges from touched files.
Neither approach is wrong per se, but the inconsistency makes the stats object unreliable when comparing edge types in incremental mode.
Suggested approach
Pick one convention and apply it uniformly:
- Option A: Use
len() everywhere (fast, counts only what was written this run).
- Option B: Use DB-wide
count(r) everywhere (accurate total, but requires extra queries).
Option A is simpler and consistent with what callers likely expect from an incremental run ("how many edges did this run produce?").
Context
Discovered during review of #97 (skip untouched files in per-file extras).
Description
In
loader.py:_write_per_file_extras, edge stats are counted two different ways:READS_ATOM/WRITES_ATOM(lines 835, 844): usesession.run("MATCH ()-[r:X]->() RETURN count(r)")— counts all edges of that type in the entire DB.READS_ENV/HANDLES_EVENT/EMITS_EVENT(lines 853, 861, 869): uselen(local_list)— counts only edges from the current batch.In full-index mode (DB wiped first), these produce the same result. In incremental mode (
touched_filesset), they diverge: DB-wide counts include pre-existing edges from prior runs, whilelen()only counts edges from touched files.Neither approach is wrong per se, but the inconsistency makes the stats object unreliable when comparing edge types in incremental mode.
Suggested approach
Pick one convention and apply it uniformly:
len()everywhere (fast, counts only what was written this run).count(r)everywhere (accurate total, but requires extra queries).Option A is simpler and consistent with what callers likely expect from an incremental run ("how many edges did this run produce?").