From a15fd78b09ce1ce6b50bdd488cd056e4219a77cf Mon Sep 17 00:00:00 2001
From: Auke Kok <auke.kok@versity.com>
Date: Wed, 18 Feb 2026 11:39:49 -0800
Subject: [PATCH 1/2] Add test and counter for stale seq in merge delta
 combining.

merge_read_item() fails to update found->seq when combining delta items
from multiple finalized log trees. Add a test case to replicate the
conditions of this issue.

The conditions to get this to reproduce are pretty tricky. We have to
exceed the block limit and that's nearly impossible at VM scale with out
significant load. Or we can force a partial merge with a trigger, which
is what this changeset does.

If we'd spam this from a shell script we'd have to deal with the
overhead of exec from `echo`, so I've dropped in a python script to
avoid that, and indeed without that we'd lose the window to hit
this bug. Similarly, we need all the readers to be background
threads to race against the short window that they can hit the double
counting as well. This all makes this test case extremely convoluted.

Instead of random values to add up, I've picked values that allow us
to identify whether double-counting happens and avoid the problem
that mounts may not have added their totals yet as we're reading
through a sliding window data set. Now we can just look at the bit
pattern and identify all the valid combinations (3 bits for 1 delta,
only rightmost bit may be set) from invalid ones (middle or left bit
in the window set). Each mount gets it's own "bit window".

This seems to hit the bug on my VM test machines at about 60%-80%
of the time, so not overly consistent. Without the python embedded
trigger smashing, it drops dramatically to ... well probably under
1% I think, it's just too slow. Previously this test sorted the
fs_nrs by RID name, but that's likely made obsolete by running the
readers all in parallel as it does now. There was also a version
that lowered the merge block limit to 64k, but that was deemed
to crude, but I note that it particularly made it easy to demonstrate
the underlying issue ;).

Signed-off-by: Auke Kok <auke.kok@versity.com>
---
 kmod/src/forest.c              |   3 +
 kmod/src/triggers.c            |   1 +
 kmod/src/triggers.h            |   1 +
 tests/golden/totl-merge-read   |   3 +
 tests/sequence                 |   1 +
 tests/tests/totl-merge-read.sh | 132 +++++++++++++++++++++++++++++++++
 6 files changed, 141 insertions(+)
 create mode 100644 tests/golden/totl-merge-read
 create mode 100644 tests/tests/totl-merge-read.sh

diff --git a/kmod/src/forest.c b/kmod/src/forest.c
index 306f4841..41e99008 100644
--- a/kmod/src/forest.c
+++ b/kmod/src/forest.c
@@ -26,6 +26,7 @@
 #include "hash.h"
 #include "srch.h"
 #include "counters.h"
+#include "triggers.h"
 #include "xattr.h"
 #include "scoutfs_trace.h"
 
@@ -731,6 +732,8 @@ static void scoutfs_forest_log_merge_worker(struct work_struct *work)
 	ret = scoutfs_btree_merge(sb, &alloc, &wri, &req.start, &req.end,
 				  &next, &comp.root, &inputs,
 				  !!(req.flags & cpu_to_le64(SCOUTFS_LOG_MERGE_REQUEST_SUBTREE)),
+				  scoutfs_trigger(sb, LOG_MERGE_FORCE_PARTIAL) ?
+				  SCOUTFS_BLOCK_LG_SIZE :
 				  SCOUTFS_LOG_MERGE_DIRTY_BYTE_LIMIT, 10,
 				  (2 * 1024 * 1024));
 	if (ret == -ERANGE) {
diff --git a/kmod/src/triggers.c b/kmod/src/triggers.c
index 317f0911..10ebfd4d 100644
--- a/kmod/src/triggers.c
+++ b/kmod/src/triggers.c
@@ -45,6 +45,7 @@ static char *names[] = {
 	[SCOUTFS_TRIGGER_SRCH_FORCE_LOG_ROTATE] = "srch_force_log_rotate",
 	[SCOUTFS_TRIGGER_SRCH_MERGE_STOP_SAFE] = "srch_merge_stop_safe",
 	[SCOUTFS_TRIGGER_STATFS_LOCK_PURGE] = "statfs_lock_purge",
+	[SCOUTFS_TRIGGER_LOG_MERGE_FORCE_PARTIAL] = "log_merge_force_partial",
 };
 
 bool scoutfs_trigger_test_and_clear(struct super_block *sb, unsigned int t)
diff --git a/kmod/src/triggers.h b/kmod/src/triggers.h
index eeb33b49..0b60f74e 100644
--- a/kmod/src/triggers.h
+++ b/kmod/src/triggers.h
@@ -8,6 +8,7 @@ enum scoutfs_trigger {
 	SCOUTFS_TRIGGER_SRCH_FORCE_LOG_ROTATE,
 	SCOUTFS_TRIGGER_SRCH_MERGE_STOP_SAFE,
 	SCOUTFS_TRIGGER_STATFS_LOCK_PURGE,
+	SCOUTFS_TRIGGER_LOG_MERGE_FORCE_PARTIAL,
 	SCOUTFS_TRIGGER_NR,
 };
 
diff --git a/tests/golden/totl-merge-read b/tests/golden/totl-merge-read
new file mode 100644
index 00000000..931671e6
--- /dev/null
+++ b/tests/golden/totl-merge-read
@@ -0,0 +1,3 @@
+== setup
+expected 4681
+== cleanup
diff --git a/tests/sequence b/tests/sequence
index 8091b1b1..c2ec96c8 100644
--- a/tests/sequence
+++ b/tests/sequence
@@ -26,6 +26,7 @@ simple-xattr-unit.sh
 retention-basic.sh
 totl-xattr-tag.sh
 quota.sh
+totl-merge-read.sh
 lock-refleak.sh
 lock-shrink-consistency.sh
 lock-shrink-read-race.sh
diff --git a/tests/tests/totl-merge-read.sh b/tests/tests/totl-merge-read.sh
new file mode 100644
index 00000000..3cd40644
--- /dev/null
+++ b/tests/tests/totl-merge-read.sh
@@ -0,0 +1,132 @@
+#
+# Test that merge_read_item() correctly updates the sequence number when
+# combining delta items from multiple finalized log trees.
+#
+# A bug in merge_read_item() fails to update found->seq to the max of
+# both items' seqs when combining deltas.  The combined item retains
+# a stale seq from the lower-RID tree processed first.
+#
+# Multiple write/merge/read passes increase the chance of hitting the
+# double-counting window.
+#
+# The log_merge_force_partial trigger forces one-block dirty limits on
+# each merge iteration, causing many partial merges that splice
+# stale-seq items into fs_root while finalized logs still exist.
+#
+# The read-xattr-totals ioctl uses a sliding cursor that can return
+# pre-merge values (< expected) or duplicate entries when btree data
+# changes between batches. To avoid test false positives we've assigned
+# a 3-bit window for each mount so that any double counting can
+# identify the double counting by overflow in the bit window.
+#
+
+t_require_commands setfattr scoutfs
+t_require_mounts 5
+
+NR_KEYS=2500
+
+echo "== setup"
+for nr in $(t_fs_nrs); do
+	d=$(eval echo \$T_D$nr)
+	for i in $(seq 1 $NR_KEYS); do
+		: > "$d/file-${nr}-${i}"
+	done
+done
+sync
+t_force_log_merge
+
+sv=$(t_server_nr)
+
+vals=( 4096 512 64 8 1 )
+expected=$(( vals[0] + vals[1] + vals[2] + vals[3] + vals[4] ))
+
+n=0
+for nr in $(t_fs_nrs); do
+	d=$(eval echo \$T_D$nr)
+	val=${vals[$n]}
+	(( n++ ))
+	for i in $(seq 1 $NR_KEYS); do
+		setfattr -n "scoutfs.totl.test.${i}.0.0" -v $val \
+			"$d/file-${nr}-${i}"
+	done
+done
+
+sync
+
+last_complete=$(t_counter log_merge_complete $sv)
+
+t_trigger_arm_silent log_merge_force_finalize_ours $sv
+t_sync_seq_index
+while test "$(t_trigger_get log_merge_force_finalize_ours $sv)" == "1"; do
+	sleep .1
+done
+
+# Spam the log_merge_force_partial trigger in a tight Python loop
+# that keeps fds open and uses pwrite, avoiding fork/exec overhead of
+# doing this in a shell "echo" loop, which is too slow.
+trigger_paths=""
+for i in $(t_fs_nrs); do
+	trigger_paths="$trigger_paths $(t_trigger_path $i)/log_merge_force_partial"
+done
+python3 -c "
+import sys
+paths = sys.argv[1:]
+while True:
+    for p in paths:
+        with open(p, 'w') as f:
+            f.write('1')
+" $trigger_paths &
+spam_pid=$!
+
+bad_dir="$T_TMPDIR/bad"
+mkdir -p "$bad_dir"
+
+read_totals() {
+	local nr=$1
+	local mnt=$(eval echo \$T_M$nr)
+	while true; do
+		echo 1 > $(t_debugfs_path $nr)/drop_weak_item_cache
+		# This is probably too elaborate, but, it's pretty neat we can
+		# illustrate the double reads this way with some awk magic.
+		scoutfs read-xattr-totals -p "$mnt" | \
+			awk -F'[ =,]+' -v expect=$expected \
+			'or($2+0, expect) != expect {
+				v = $2+0; s = ""
+				split("0 1 2 3 4", m)
+				split("12 9 6 3 0", sh)
+				for (i = 1; i <= 5; i++) {
+					c = and(rshift(v, sh[i]+0), 7)
+					if (c > 1) s = s " m" m[i] ":" c
+				}
+				printf "%s (%s)\n", $0, substr(s, 2)
+			}' >> "$bad_dir/$nr"
+	done
+}
+
+echo "expected $expected"
+reader_pids=""
+for nr in $(t_fs_nrs); do
+	read_totals $nr &
+	reader_pids="$reader_pids $!"
+done
+
+while (( $(t_counter log_merge_complete $sv) == last_complete )); do
+	sleep .1
+done
+
+t_silent_kill $spam_pid $reader_pids
+
+for nr in $(t_fs_nrs); do
+	if [ -s "$bad_dir/$nr" ]; then
+		echo "double-counted totals on mount $nr:"
+		cat "$bad_dir/$nr"
+	fi
+done
+
+echo "== cleanup"
+for nr in $(t_fs_nrs); do
+	d=$(eval echo \$T_D$nr)
+	find "$d" -maxdepth 1 -name "file-${nr}-*" -delete
+done
+
+t_pass

From 5c73c8d5c138e00534cfc45e6acf384669067aca Mon Sep 17 00:00:00 2001
From: Auke Kok <auke.kok@versity.com>
Date: Fri, 27 Feb 2026 10:06:05 -0800
Subject: [PATCH 2/2] Update seq when merging deltas from partial log merge.

Two different clients can write delta's for totl indexes at the same
time, recording their changes. When merged, a reader should apply both
in order, and only once. To do so, the seq determines whether the delta
has been applied already.

The code fails to update the seq while walking the trees for deltas to
apply. Subsequently, when processing subsequent trees, it could
re-process deltas already applied. In case of a large negative delta
(e.g. removal of large amounts of files), the totl value could become
negative, resulting in quota lockout.

The fix is simple: advance the seq when reading partial delta merges
to avoid double counting.

Signed-off-by: Auke Kok <auke.kok@versity.com>
---
 kmod/src/btree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kmod/src/btree.c b/kmod/src/btree.c
index 7a21afc7..1a2d85e9 100644
--- a/kmod/src/btree.c
+++ b/kmod/src/btree.c
@@ -2183,6 +2183,8 @@ static int merge_read_item(struct super_block *sb, struct scoutfs_key *key, u64
 		if (ret > 0) {
 			if (ret == SCOUTFS_DELTA_COMBINED) {
 				scoutfs_inc_counter(sb, btree_merge_delta_combined);
+				if (seq > found->seq)
+					found->seq = seq;
 			} else if (ret == SCOUTFS_DELTA_COMBINED_NULL) {
 				scoutfs_inc_counter(sb, btree_merge_delta_null);
 				free_mitem(rng, found);