Skip to content

Conversation

@sayrer
Copy link
Contributor

@sayrer sayrer commented Jan 20, 2026

I think this patch is valid, and perf.sh is doing too much measuring of loading test data. This one will not be an easy rubber-stamp like the other Claude patches. The perf.sh numbers don't look like an improvement, but the profiles are much different.

Before:

EXACT: 3.088
EXACT: 3.088
EXACT: 3.088
ANYTHING-BUT: 0.533
ANYTHING-BUT: 0.530
ANYTHING-BUT: 0.534
SHELLSTYLE: 1.393
SHELLSTYLE: 1.402
SHELLSTYLE: 1.402
EQUALS_IGNORE-CASE: 1.746
EQUALS_IGNORE-CASE: 1.746
EQUALS_IGNORE-CASE: 1.746
REGEXP: 1.705
REGEXP: 1.705
REGEXP: 1.590
Avg: 2.821
Avg: 2.824
Avg: 2.787

After:

EXACT: 3.133
EXACT: 3.133
EXACT: 3.133
ANYTHING-BUT: 0.531
ANYTHING-BUT: 0.535
ANYTHING-BUT: 0.534
SHELLSTYLE: 1.323
SHELLSTYLE: 1.440
SHELLSTYLE: 1.459
EQUALS_IGNORE-CASE: 1.746
EQUALS_IGNORE-CASE: 1.761
EQUALS_IGNORE-CASE: 1.746
REGEXP: 1.590
REGEXP: 1.732
REGEXP: 1.761
Avg: 2.822
Avg: 2.820
Avg: 2.878

Comparing the original profile to the final one after optimizations, here is the shift in where CPU time is spent:

Original Profile (High Churn)

The profile was dominated by the Go runtime managing memory (allocating, clearing, sweeping), rather than doing useful work.

      flat  flat%   sum%        cum   cum%
     3.25s 14.95% 14.95%      3.25s 14.95%  runtime.madvise                <-- MEMORY
     2.80s 12.88% 27.83%      2.80s 12.88%  runtime.memclrNoHeapPointers   <-- MEMORY
     1.37s  6.30% 34.13%      1.77s  8.14%  runtime.mallocgcTiny           <-- MEMORY
     1.18s  5.43% 39.56%      1.18s  5.43%  runtime.(*gcBits).bytep
     ...
     0.34s  1.56% 68.22%      7.24s 33.30%  quamina...(*fieldMatcher).transitionOn  <-- Actual Work

Interpretation: Nearly 35% of the "flat" CPU time was spent purely on memory mechanics (madvise, memclr, malloc). The actual matching logic (transitionOn) was buried.


Optimized Profile (Efficient Work)

Now, the profile looks much healthier. The top consumers are dominated by test data loading (which is unavoidable in this benchmark) and actual application logic.

      flat  flat%   sum%        cum   cum%
     0.84s 53.85% 53.85%      0.84s 53.85%  runtime.madvise
     0.24s 15.38% 69.23%      0.24s 15.38%  syscall.syscall
     0.09s  5.77% 75.00%      0.09s  5.77%  runtime.memclrNoHeapPointers
     ...
     0.04s  2.56% 82.69%      0.14s  8.97%  quamina...(*flattenJSON).readObject     <-- Actual Work
     ...
     0.01s  0.64% 94.23%      0.26s 16.67%  quamina...(*Quamina).MatchesForEvent    <-- Actual Work

Interpretation:

  1. Reduced Overhead: While madvise is still present (likely due to the massive initial data load of citylots.json into memory), the raw time spent there dropped from 3.25s to 0.84s.
  2. Logic Visibility: The Quamina functions like readObject and MatchesForEvent are now much more prominent relative to the total runtime (excluding the test setup).
  3. Faster Execution: The total execution time for the benchmark dropped significantly (from ~17s to ~2s in the go test output), meaning the CPU is doing far less "administrative" work for every event matched.

The allocation-heavy functions like mallocgcTiny have effectively disappeared from the top of the profile, confirming that the pooling and slab allocation strategies successfully eliminated the allocation churn.

@timbray
Copy link
Owner

timbray commented Jan 20, 2026

Yeah, this one is a tougher sell. A few notes:

  1. I also tried wiring in sync.Pool and eventually discarded it due to it having no measurable effect.
  2. If you look at the actual benchmark called by perf.sh, TestRulerC12, it starts by pulling the test file into memory so that ought to be easy to segregate from the measurements. The Go profiler lets you restrict the calls to descendants of just one method IIRC, that would probably be more useful output.
  3. This PR has two or three relatively independent changes. Two of are simple, adding reset() methods, but that epsilonClosure stuff has a bit of a code smell to it. Having said that, I am sure that epsilonClosure is one of the best places to look around to make NFA traversal faster.
  4. But I'm pretty sure the best place to optimize is the automaton construction & merging. Epsilons are freaking expensive at runtime and I'm pretty sure the code in nfa.go could be smarter about generating fewer of them.
  5. Having said all that, the drop in raw elapsed time seems like a really big deal. Do you have any insights about how that happened?

@sayrer
Copy link
Contributor Author

sayrer commented Jan 20, 2026

Maybe a separate PR with that benchmark program first?

@sayrer
Copy link
Contributor Author

sayrer commented Jan 20, 2026

  1. Having said all that, the drop in raw elapsed time seems like a really big deal. Do you have any insights about how that happened?

I think the Quamina code does allocate too much, but perf.sh doesn't let these optimizations show through. I think that Go command line program with CPU and memory footprint measurements will show it.

@sayrer
Copy link
Contributor Author

sayrer commented Jan 20, 2026

Also, totally willing to separate these PRs. But sometimes this kind of stuff needs to be together, so things start fitting in caches. So, four seemingly ineffective optimizations all of a sudden work.

@timbray
Copy link
Owner

timbray commented Jan 20, 2026

I think the Quamina code does allocate too much, but perf.sh doesn't let these optimizations show through. I think that Go command line program with CPU and memory footprint measurements will show it.

Well, the underlying benchmark decompresses and loads 219M into memory, so yeah, that's a lot of allocation, which needs to be disentangled from the Quamina MatchesForEvent() code. Anyhow, will have a look at your benchmark. Obviously, it's real-world clock time that matters.

@sayrer
Copy link
Contributor Author

sayrer commented Jan 20, 2026

It's not a bad thing, fwiw. It means the Quamina code is faster than the surrounding code, so it must be isolated.

@timbray
Copy link
Owner

timbray commented Jan 20, 2026

Will have a look at that memory profile later today. Might add a CPU profile too.

@timbray
Copy link
Owner

timbray commented Jan 20, 2026

Hmm, I was trying to compare the runtimes of your branch and my "issue 66" PR branch by running your perf.go program, and was getting basically identical numbers, and then I see it imports "quamina.net/go/quamina" and I don't know enough about the go "import" mechanism but is it possible it's fetching the same library code on both sides?

@sayrer
Copy link
Contributor Author

sayrer commented Jan 20, 2026

I think that is probably the issue. I'll look at the import paths and then make the thing slower to check that it's importing the correct library.

@timbray
Copy link
Owner

timbray commented Jan 20, 2026

Huh, can't figure out the incantation to make the memory profiler ignore the getCl2Lines and just profile the MatchesForEvent & descendants.

The cpu profiler reveals … no surprises. This particular benchmark really stresses FlattenJSON because the field values are big and the benchmarks use most of them. I'm going to have to buckle down and invest some time into building an actually useful benchmarker that lets the profiler cast light on what's going on in Quamina's own code.

@sayrer
Copy link
Contributor Author

sayrer commented Jan 20, 2026

While I get this patch in order, which may take some time, I have seen this pattern before. It's spending a lot of CPU time in the allocator.

@timbray
Copy link
Owner

timbray commented Jan 20, 2026

Yeah, it's complicated. The pattern of memory usage varies a lot depending on what kind of patterns you're looking for and what kind of records you're scanning. And is often going to be dominated by the flattener code. The patterns that can run in DFA mode (everything but "wildcard" and "regexp") are really hard to improve at this point, they've been profiled and optimized a lot. The ones that need NFA mode have both automaton-creation and automaton-traversal ugliness, and haven't been really well profiled yet. Once again, the biggest win would be eliminating epsilon transitions, if Claude could do that I'd be super impressed.

@sayrer
Copy link
Contributor Author

sayrer commented Jan 21, 2026

I went at this for a little bit, but the result was that the tests measured data loading more than Quamina. So, looked at a few different angles, and found that every time.

@timbray
Copy link
Owner

timbray commented Jan 21, 2026

So, looked at a few different angles, and found that every time.

Not sure what you're saying here. Does this latest version provide a significant performance increase?

@sayrer
Copy link
Contributor Author

sayrer commented Jan 21, 2026

The results I consistently found were that these were faster, but that the benchmarks measure too much non-Quamina code. As in, code that Quamina doesn't even call into. Just benchmark setup, data loading, etc. The profiler said, after the latest patch, "this is the fastest you can make it in idiomatic Go", and I agree. We can let this one sit if you don't mind, it's not harmful.

I was working on a similar task in another project, and I suspect that the benchmarks are the thing that needs work first (as you said above).

@sayrer
Copy link
Contributor Author

sayrer commented Jan 21, 2026

Running that perf program and inspecting the generated files, it definitely makes a difference. Something like

go tool pprof -http=:8080 cmd/perf/perf.bin cpu.prof

(you need Graphviz)

Firstly, the benchmark is spending 65% of the program loading that big file, leaving only a few hundred ms of code to benchmark. If you look at those pprof graphs with and without the patches here, there is a huge difference in the actual Quamina code. It is tough to discern because of noise from loading that file and pprof itself, but there is a long callgraph under the traverse functions that is just gone.

@timbray
Copy link
Owner

timbray commented Jan 21, 2026

Having slept on this, I filed #471 - when we're considering optimizations, we shouldn't have to dig through the entrails of the profiler output, the effects should be straightforwardly visible. So I'm going to leave this PR on ice until I (or someone else) gets a better benchmark in place. I don't want to close the PR though because I think it'll be useful in benchmarking proposed benchmarks.

@timbray
Copy link
Owner

timbray commented Jan 22, 2026

OK, I added a simple (but I think quite useful) benchmark that I'm pretty sure focuses on the interesting part of the problem. Why don't you try running this PR against that?

@sayrer
Copy link
Contributor Author

sayrer commented Jan 22, 2026

I'll try it when I get chance (probably in a day or so). I think the benchmark in #473 looks good but needs to have the input randomized in a loop to insert at least CJK characters and emoji. (this is because you get into the really gnarly parts of the unicode lookup tables, and it's easy to optimize for ASCII if you don't do this, but you really want to hit the variable-length byte cases of UTF-8)

1. **Removed slab allocator** from `epsilonClosure` - was causing more allocations (this was only on this branch, not on main)
2. **Pooled match results** - Added `resultBuf` to `nfaBuffers` with `matchSet.matchesInto()` method
3. **Pooled DFA transitions** - Added `dfaTransitions` buffer to `nfaBuffers` for `traverseDFA()` in the matching hot path (not for `traverseNFA` since results are returned to callers)
4. **Fixed buffer initialization** - Ensured `buf1` and `buf2` are cleared at start of `traverseNFA

Metric                          Before          After           Improvement
─────────────────────────────────────────────────────────────────────────────
Bytes/op                        993 B/op        666 B/op        -33%
Allocs/op                       31 allocs/op    22 allocs/op    -29%
Throughput                      ~97k/sec        ~101k/sec       +4%
@sayrer
Copy link
Contributor Author

sayrer commented Jan 23, 2026

This benchmark is still so fast that the throughput is a little noisy, but this summary is about what I got, and the Bytes and Alloc numbers are consistent both on main and this branch.

    Metric                          Before          After           Improvement
    ─────────────────────────────────────────────────────────────────────────────
    Bytes/op                        993 B/op        666 B/op        -33%
    Allocs/op                       31 allocs/op    22 allocs/op    -29%
    Throughput                      ~97k/sec        ~101k/sec       +4%
% go test -bench=^Benchmark8259Example$ -run=^$                          
FA: Field matchers: 2 (avg size 2.500, max 4)
Value matchers: 5
SmallTables 20371 (splices 6, avg 4.033, max 66, epsilons avg 0.001, max 2) singletons 1
100358/sec
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
Benchmark8259Example-20    	  129570	      9964 ns/op	     676 B/op	      22 allocs/op
PASS
ok  	quamina.net/go/quamina	1.713s

Copy link
Owner

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little dubious about changing a bunch of things at once, but I guess I can live with this. Do please glance at comments.

func (m *coreMatcher) matchesForJSONWithFlattener(event []byte, f Flattener) ([]X, error) {
fields, _ := f.Flatten(event, m.getSegmentsTreeTracker())
return m.matchesForFields(fields, newNfaBuffers())
bufs := m.bufferPool.Get().(*nfaBuffers)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I guess, but this func is only called in benchmark scenarios. Glance at Quamina.MatchesForEvent for the production path. The higher level routine is much smarter about keeping the nfaBufs around so I'd like to migrate our benchmarks to that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, I lied. It's mostly called in testing scenarios.

matches *matchSet
transmap *transmap
resultBuf []X
dfaTransitions []*fieldMatcher
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little nervous on general engineering principles about changing a bunch of. things at once. I wonder which of these actually introduce the measured benefits? Probably not a blocker I guess. See comments about sync.Pool above.

Copy link
Contributor Author

@sayrer sayrer Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, I can break them up if you want. They were all tested individually, but they're all pretty much the same idea. As in, I did one after another.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too much work, I'd sure appreciate that. Will probably end up taking all/most of them, but still.

type coreMatcher struct {
updateable atomic.Pointer[coreFields]
lock sync.Mutex
bufferPool sync.Pool
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dubious about this one, tried it before and it produced zero measurable change. Probably harmless I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can do this one last. If it helps, it'll be easier to see at the end.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're going to re-spin the PR? Excellent, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this one open for the conversation, I'll hang some smaller ones off this one with a number reference.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has the advantage of teaching us things by observing the cpu/memory effects of the tweaks. BTW I'm using Goland which has this lovely thing where you can run any test or benchmark with cpu or memory profiling and it draws you a flame graph in the IDE that's full of lessons. Recommended.

Copy link
Contributor Author

@sayrer sayrer Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, these are all done with pprof. Sometimes graphical visualizations, but often just reading the output of go tool pprof -top. Something like

go test -bench=^Benchmark8259Example$ -run=^$ -cpuprofile=cpu_pr.prof -benchtime=5s

@sayrer
Copy link
Contributor Author

sayrer commented Jan 24, 2026

I think I'll try some of these in a more constrained environment. That second patch was a huge allocation win, and only a little bit of a CPU win. But I think that's not realistic. Nothing in the queue is a loss, but memory is free for this program on a Mac Studio with 64GB of RAM (or Tim's M2 Macbook, which seems to be a little faster).

This program will be running in container/VM/EC2 instances that are not designed to let you forget about what you have running.

@timbray
Copy link
Owner

timbray commented Jan 24, 2026

FWIW I am working on a parallel track, namely reviving the neglected nfa2Dfa code. The new benchmark runs just over twice as fast when I squish it into a DFA - this with your latest patch:

227396/sec
Benchmark8259Example-12    	  272204	      4406 ns/op	      64 B/op	      11 allocs/op

However, there are issues. First, nfa2dfa has a weird corner-case bug I'm chasing. Second, you can't just automatically apply this to every NFA because for certain NFAs you get into horrible O(2**N) scaling. So it's going to have to be an option and is going to need API design.

@sayrer
Copy link
Contributor Author

sayrer commented Jan 27, 2026

Summary of the combined optimizations from #478, #479, #482

Benchmark Metric Main Combined Improvement
Benchmark8259Example ns/op 10,270 6,040 41% faster
B/op 679 252 63% less memory
allocs/op 22 13 41% fewer allocs
BenchmarkShellstyleMultiMatch ns/op 99,600 36,900 63% faster
B/op 5,870 2,616 55% less memory
allocs/op 101 95 6% fewer allocs

@timbray
Copy link
Owner

timbray commented Jan 27, 2026

Closing this as superseded by all that other work, for which thanks.

@timbray timbray closed this Jan 27, 2026
@sayrer sayrer deleted the more_allocation_changes branch January 27, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants