More allocation avoidance. #470

sayrer · 2026-01-20T19:02:26Z

I think this patch is valid, and perf.sh is doing too much measuring of loading test data. This one will not be an easy rubber-stamp like the other Claude patches. The perf.sh numbers don't look like an improvement, but the profiles are much different.

Before:

EXACT: 3.088
EXACT: 3.088
EXACT: 3.088
ANYTHING-BUT: 0.533
ANYTHING-BUT: 0.530
ANYTHING-BUT: 0.534
SHELLSTYLE: 1.393
SHELLSTYLE: 1.402
SHELLSTYLE: 1.402
EQUALS_IGNORE-CASE: 1.746
EQUALS_IGNORE-CASE: 1.746
EQUALS_IGNORE-CASE: 1.746
REGEXP: 1.705
REGEXP: 1.705
REGEXP: 1.590
Avg: 2.821
Avg: 2.824
Avg: 2.787

After:

EXACT: 3.133
EXACT: 3.133
EXACT: 3.133
ANYTHING-BUT: 0.531
ANYTHING-BUT: 0.535
ANYTHING-BUT: 0.534
SHELLSTYLE: 1.323
SHELLSTYLE: 1.440
SHELLSTYLE: 1.459
EQUALS_IGNORE-CASE: 1.746
EQUALS_IGNORE-CASE: 1.761
EQUALS_IGNORE-CASE: 1.746
REGEXP: 1.590
REGEXP: 1.732
REGEXP: 1.761
Avg: 2.822
Avg: 2.820
Avg: 2.878

Comparing the original profile to the final one after optimizations, here is the shift in where CPU time is spent:

Original Profile (High Churn)

The profile was dominated by the Go runtime managing memory (allocating, clearing, sweeping), rather than doing useful work.

      flat  flat%   sum%        cum   cum%
     3.25s 14.95% 14.95%      3.25s 14.95%  runtime.madvise                <-- MEMORY
     2.80s 12.88% 27.83%      2.80s 12.88%  runtime.memclrNoHeapPointers   <-- MEMORY
     1.37s  6.30% 34.13%      1.77s  8.14%  runtime.mallocgcTiny           <-- MEMORY
     1.18s  5.43% 39.56%      1.18s  5.43%  runtime.(*gcBits).bytep
     ...
     0.34s  1.56% 68.22%      7.24s 33.30%  quamina...(*fieldMatcher).transitionOn  <-- Actual Work

Interpretation: Nearly 35% of the "flat" CPU time was spent purely on memory mechanics (madvise, memclr, malloc). The actual matching logic (transitionOn) was buried.

Optimized Profile (Efficient Work)

Now, the profile looks much healthier. The top consumers are dominated by test data loading (which is unavoidable in this benchmark) and actual application logic.

      flat  flat%   sum%        cum   cum%
     0.84s 53.85% 53.85%      0.84s 53.85%  runtime.madvise
     0.24s 15.38% 69.23%      0.24s 15.38%  syscall.syscall
     0.09s  5.77% 75.00%      0.09s  5.77%  runtime.memclrNoHeapPointers
     ...
     0.04s  2.56% 82.69%      0.14s  8.97%  quamina...(*flattenJSON).readObject     <-- Actual Work
     ...
     0.01s  0.64% 94.23%      0.26s 16.67%  quamina...(*Quamina).MatchesForEvent    <-- Actual Work

Interpretation:

Reduced Overhead: While madvise is still present (likely due to the massive initial data load of citylots.json into memory), the raw time spent there dropped from 3.25s to 0.84s.
Logic Visibility: The Quamina functions like readObject and MatchesForEvent are now much more prominent relative to the total runtime (excluding the test setup).
Faster Execution: The total execution time for the benchmark dropped significantly (from ~17s to ~2s in the go test output), meaning the CPU is doing far less "administrative" work for every event matched.

The allocation-heavy functions like mallocgcTiny have effectively disappeared from the top of the profile, confirming that the pooling and slab allocation strategies successfully eliminated the allocation churn.

timbray · 2026-01-20T19:31:52Z

Yeah, this one is a tougher sell. A few notes:

I also tried wiring in sync.Pool and eventually discarded it due to it having no measurable effect.
If you look at the actual benchmark called by perf.sh, TestRulerC12, it starts by pulling the test file into memory so that ought to be easy to segregate from the measurements. The Go profiler lets you restrict the calls to descendants of just one method IIRC, that would probably be more useful output.
This PR has two or three relatively independent changes. Two of are simple, adding reset() methods, but that epsilonClosure stuff has a bit of a code smell to it. Having said that, I am sure that epsilonClosure is one of the best places to look around to make NFA traversal faster.
But I'm pretty sure the best place to optimize is the automaton construction & merging. Epsilons are freaking expensive at runtime and I'm pretty sure the code in nfa.go could be smarter about generating fewer of them.
Having said all that, the drop in raw elapsed time seems like a really big deal. Do you have any insights about how that happened?

sayrer · 2026-01-20T19:41:35Z

Maybe a separate PR with that benchmark program first?

sayrer · 2026-01-20T19:43:52Z

Having said all that, the drop in raw elapsed time seems like a really big deal. Do you have any insights about how that happened?

I think the Quamina code does allocate too much, but perf.sh doesn't let these optimizations show through. I think that Go command line program with CPU and memory footprint measurements will show it.

sayrer · 2026-01-20T20:04:12Z

Also, totally willing to separate these PRs. But sometimes this kind of stuff needs to be together, so things start fitting in caches. So, four seemingly ineffective optimizations all of a sudden work.

timbray · 2026-01-20T20:12:38Z

I think the Quamina code does allocate too much, but perf.sh doesn't let these optimizations show through. I think that Go command line program with CPU and memory footprint measurements will show it.

Well, the underlying benchmark decompresses and loads 219M into memory, so yeah, that's a lot of allocation, which needs to be disentangled from the Quamina MatchesForEvent() code. Anyhow, will have a look at your benchmark. Obviously, it's real-world clock time that matters.

sayrer · 2026-01-20T20:17:00Z

It's not a bad thing, fwiw. It means the Quamina code is faster than the surrounding code, so it must be isolated.

timbray · 2026-01-20T20:19:42Z

Will have a look at that memory profile later today. Might add a CPU profile too.

timbray · 2026-01-20T22:35:13Z

Hmm, I was trying to compare the runtimes of your branch and my "issue 66" PR branch by running your perf.go program, and was getting basically identical numbers, and then I see it imports "quamina.net/go/quamina" and I don't know enough about the go "import" mechanism but is it possible it's fetching the same library code on both sides?

sayrer · 2026-01-20T22:42:13Z

I think that is probably the issue. I'll look at the import paths and then make the thing slower to check that it's importing the correct library.

timbray · 2026-01-20T23:00:58Z

Huh, can't figure out the incantation to make the memory profiler ignore the getCl2Lines and just profile the MatchesForEvent & descendants.

The cpu profiler reveals … no surprises. This particular benchmark really stresses FlattenJSON because the field values are big and the benchmarks use most of them. I'm going to have to buckle down and invest some time into building an actually useful benchmarker that lets the profiler cast light on what's going on in Quamina's own code.

sayrer · 2026-01-20T23:10:26Z

While I get this patch in order, which may take some time, I have seen this pattern before. It's spending a lot of CPU time in the allocator.

timbray · 2026-01-20T23:30:35Z

Yeah, it's complicated. The pattern of memory usage varies a lot depending on what kind of patterns you're looking for and what kind of records you're scanning. And is often going to be dominated by the flattener code. The patterns that can run in DFA mode (everything but "wildcard" and "regexp") are really hard to improve at this point, they've been profiled and optimized a lot. The ones that need NFA mode have both automaton-creation and automaton-traversal ugliness, and haven't been really well profiled yet. Once again, the biggest win would be eliminating epsilon transitions, if Claude could do that I'd be super impressed.

sayrer · 2026-01-21T00:10:31Z

I went at this for a little bit, but the result was that the tests measured data loading more than Quamina. So, looked at a few different angles, and found that every time.

…uamina into more_allocation_changes

timbray · 2026-01-21T01:02:51Z

So, looked at a few different angles, and found that every time.

Not sure what you're saying here. Does this latest version provide a significant performance increase?

sayrer · 2026-01-21T02:32:43Z

The results I consistently found were that these were faster, but that the benchmarks measure too much non-Quamina code. As in, code that Quamina doesn't even call into. Just benchmark setup, data loading, etc. The profiler said, after the latest patch, "this is the fastest you can make it in idiomatic Go", and I agree. We can let this one sit if you don't mind, it's not harmful.

I was working on a similar task in another project, and I suspect that the benchmarks are the thing that needs work first (as you said above).

sayrer · 2026-01-21T16:37:07Z

Running that perf program and inspecting the generated files, it definitely makes a difference. Something like

go tool pprof -http=:8080 cmd/perf/perf.bin cpu.prof

(you need Graphviz)

Firstly, the benchmark is spending 65% of the program loading that big file, leaving only a few hundred ms of code to benchmark. If you look at those pprof graphs with and without the patches here, there is a huge difference in the actual Quamina code. It is tough to discern because of noise from loading that file and pprof itself, but there is a long callgraph under the traverse functions that is just gone.

timbray · 2026-01-21T18:06:17Z

Having slept on this, I filed #471 - when we're considering optimizations, we shouldn't have to dig through the entrails of the profiler output, the effects should be straightforwardly visible. So I'm going to leave this PR on ice until I (or someone else) gets a better benchmark in place. I don't want to close the PR though because I think it'll be useful in benchmarking proposed benchmarks.

timbray · 2026-01-22T23:01:07Z

OK, I added a simple (but I think quite useful) benchmark that I'm pretty sure focuses on the interesting part of the problem. Why don't you try running this PR against that?

sayrer · 2026-01-22T23:41:34Z

I'll try it when I get chance (probably in a day or so). I think the benchmark in #473 looks good but needs to have the input randomized in a loop to insert at least CJK characters and emoji. (this is because you get into the really gnarly parts of the unicode lookup tables, and it's easy to optimize for ASCII if you don't do this, but you really want to hit the variable-length byte cases of UTF-8)

1. **Removed slab allocator** from `epsilonClosure` - was causing more allocations (this was only on this branch, not on main) 2. **Pooled match results** - Added `resultBuf` to `nfaBuffers` with `matchSet.matchesInto()` method 3. **Pooled DFA transitions** - Added `dfaTransitions` buffer to `nfaBuffers` for `traverseDFA()` in the matching hot path (not for `traverseNFA` since results are returned to callers) 4. **Fixed buffer initialization** - Ensured `buf1` and `buf2` are cleared at start of `traverseNFA Metric Before After Improvement ───────────────────────────────────────────────────────────────────────────── Bytes/op 993 B/op 666 B/op -33% Allocs/op 31 allocs/op 22 allocs/op -29% Throughput ~97k/sec ~101k/sec +4%

sayrer · 2026-01-23T17:53:39Z

This benchmark is still so fast that the throughput is a little noisy, but this summary is about what I got, and the Bytes and Alloc numbers are consistent both on main and this branch.

    Metric                          Before          After           Improvement
    ─────────────────────────────────────────────────────────────────────────────
    Bytes/op                        993 B/op        666 B/op        -33%
    Allocs/op                       31 allocs/op    22 allocs/op    -29%
    Throughput                      ~97k/sec        ~101k/sec       +4%

% go test -bench=^Benchmark8259Example$ -run=^$                          
FA: Field matchers: 2 (avg size 2.500, max 4)
Value matchers: 5
SmallTables 20371 (splices 6, avg 4.033, max 66, epsilons avg 0.001, max 2) singletons 1
100358/sec
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
Benchmark8259Example-20    	  129570	      9964 ns/op	     676 B/op	      22 allocs/op
PASS
ok  	quamina.net/go/quamina	1.713s

timbray

A little dubious about changing a bunch of things at once, but I guess I can live with this. Do please glance at comments.

timbray · 2026-01-23T19:14:34Z

core_matcher.go

 func (m *coreMatcher) matchesForJSONWithFlattener(event []byte, f Flattener) ([]X, error) {
 	fields, _ := f.Flatten(event, m.getSegmentsTreeTracker())
-	return m.matchesForFields(fields, newNfaBuffers())
+	bufs := m.bufferPool.Get().(*nfaBuffers)


OK I guess, but this func is only called in benchmark scenarios. Glance at Quamina.MatchesForEvent for the production path. The higher level routine is much smarter about keeping the nfaBufs around so I'd like to migrate our benchmarks to that.

And, I lied. It's mostly called in testing scenarios.

timbray · 2026-01-23T19:15:55Z

nfa.go

+	matches        *matchSet
+	transmap       *transmap
+	resultBuf      []X
+	dfaTransitions []*fieldMatcher


A little nervous on general engineering principles about changing a bunch of. things at once. I wonder which of these actually introduce the measured benefits? Probably not a blocker I guess. See comments about sync.Pool above.

Ah OK, I can break them up if you want. They were all tested individually, but they're all pretty much the same idea. As in, I did one after another.

If it's not too much work, I'd sure appreciate that. Will probably end up taking all/most of them, but still.

timbray · 2026-01-23T19:16:44Z

core_matcher.go

 type coreMatcher struct {
 	updateable atomic.Pointer[coreFields]
 	lock       sync.Mutex
+	bufferPool sync.Pool


Dubious about this one, tried it before and it produced zero measurable change. Probably harmless I guess.

OK, I can do this one last. If it helps, it'll be easier to see at the end.

So you're going to re-spin the PR? Excellent, thanks.

Let's leave this one open for the conversation, I'll hang some smaller ones off this one with a number reference.

This also has the advantage of teaching us things by observing the cpu/memory effects of the tweaks. BTW I'm using Goland which has this lovely thing where you can run any test or benchmark with cpu or memory profiling and it draws you a flame graph in the IDE that's full of lessons. Recommended.

Oh, these are all done with pprof. Sometimes graphical visualizations, but often just reading the output of go tool pprof -top. Something like

go test -bench=^Benchmark8259Example$ -run=^$ -cpuprofile=cpu_pr.prof -benchtime=5s

sayrer · 2026-01-24T02:28:17Z

I think I'll try some of these in a more constrained environment. That second patch was a huge allocation win, and only a little bit of a CPU win. But I think that's not realistic. Nothing in the queue is a loss, but memory is free for this program on a Mac Studio with 64GB of RAM (or Tim's M2 Macbook, which seems to be a little faster).

This program will be running in container/VM/EC2 instances that are not designed to let you forget about what you have running.

timbray · 2026-01-24T02:38:20Z

FWIW I am working on a parallel track, namely reviving the neglected nfa2Dfa code. The new benchmark runs just over twice as fast when I squish it into a DFA - this with your latest patch:

227396/sec
Benchmark8259Example-12    	  272204	      4406 ns/op	      64 B/op	      11 allocs/op

However, there are issues. First, nfa2dfa has a weird corner-case bug I'm chasing. Second, you can't just automatically apply this to every NFA because for certain NFAs you get into horrible O(2**N) scaling. So it's going to have to be an option and is going to need API design.

sayrer · 2026-01-27T19:59:22Z

Summary of the combined optimizations from #478, #479, #482

Benchmark	Metric	Main	Combined	Improvement
Benchmark8259Example	ns/op	10,270	6,040	41% faster
	B/op	679	252	63% less memory
	allocs/op	22	13	41% fewer allocs
BenchmarkShellstyleMultiMatch	ns/op	99,600	36,900	63% faster
	B/op	5,870	2,616	55% less memory
	allocs/op	101	95	6% fewer allocs

timbray · 2026-01-27T20:38:42Z

Closing this as superseded by all that other work, for which thanks.

More allocation avoidance.

2bea9b9

Add a Go command line program for benchmarking.

c0e902c

Fix test data path

1e3b771

sayrer added 3 commits January 20, 2026 16:36

feat: Avoid allocating empty ArrayTrail slices

d857493

Merge branch 'timbray:main' into more_allocation_changes

57ad1a8

Merge branch 'more_allocation_changes' of https://github.com/sayrer/q…

07ca1ee

…uamina into more_allocation_changes

timbray mentioned this pull request Jan 21, 2026

Better benchmarks needed #471

Open

sayrer added 4 commits January 23, 2026 08:51

Merge branch 'main' into more_allocation_changes

19feec2

Remove old benchmark.

4e1a267

Remove unused parameter from unsuccessful optimization.

4ab4c1e

sayrer added 2 commits January 23, 2026 10:23

Remove unneeded test code. Revert to main version.

9a6e5b2

Remove ugly optimization that I didn't intend to check in.

1b47a8b

timbray reviewed Jan 23, 2026

View reviewed changes

This was referenced Jan 23, 2026

Pool transitions buffers to reduce allocations #475

Merged

Pool result matches #476

Merged

This was referenced Jan 24, 2026

Pool transmap buffer #478

Merged

Use batch allocation for ArrayTrail slices (flatten_json.go) #479

Merged

kaizen: Precomputed epsilon closures #482

Open

timbray closed this Jan 27, 2026

sayrer deleted the more_allocation_changes branch January 27, 2026 21:20

More allocation avoidance. #470

More allocation avoidance. #470

Uh oh!

Conversation

sayrer commented Jan 20, 2026

Original Profile (High Churn)

Optimized Profile (Efficient Work)

Uh oh!

timbray commented Jan 20, 2026

Uh oh!

sayrer commented Jan 20, 2026

Uh oh!

sayrer commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Jan 20, 2026

Uh oh!

timbray commented Jan 20, 2026

Uh oh!

sayrer commented Jan 20, 2026

Uh oh!

timbray commented Jan 20, 2026

Uh oh!

timbray commented Jan 20, 2026

Uh oh!

sayrer commented Jan 20, 2026

Uh oh!

timbray commented Jan 20, 2026

Uh oh!

sayrer commented Jan 20, 2026

Uh oh!

timbray commented Jan 20, 2026

Uh oh!

sayrer commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray commented Jan 21, 2026

Uh oh!

sayrer commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray commented Jan 21, 2026

Uh oh!

timbray commented Jan 22, 2026

Uh oh!

sayrer commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayrer Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayrer Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

sayrer commented Jan 20, 2026 •

edited

Loading

sayrer commented Jan 21, 2026 •

edited

Loading

sayrer commented Jan 21, 2026 •

edited

Loading

sayrer commented Jan 21, 2026 •

edited

Loading

sayrer commented Jan 22, 2026 •

edited

Loading

sayrer commented Jan 23, 2026 •

edited

Loading

sayrer Jan 23, 2026 •

edited

Loading

sayrer Jan 23, 2026 •

edited

Loading

timbray commented Jan 24, 2026 •

edited

Loading

sayrer commented Jan 27, 2026 •

edited

Loading