Skip to content

Conversation

@sayrer
Copy link
Contributor

@sayrer sayrer commented Jan 24, 2026

This one was written for TestRulerCl2, but didn't show up there in wall-clock time because the savings weren't evident relative to the allocation of the test data. The savings were there in the pprof file. It turns out you can see it in BenchmarkNumberMatching quite clearly.

This adds the pool to coreMatcher that got a meh in #470, because lack of that is distorting the benchmarks. If you add these cached buffers, making a new nfaBuffers struct becomes more expensive. So, not having that in coreMatcher was distorting the cost (and the high-level API already does this optimization).

This patch was a regression (14 allocs/op) before I changed coreMatcher.

This one also adds a new test, BenchmarkShellstyleMultiMatch, that I used while profiling. I figured it was worth keeping.

Before:

 % go test -bench='^BenchmarkNumberMatching$' -run=^$ -benchmem
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
BenchmarkNumberMatching-20    	 1000000	      1023 ns/op	    2418 B/op	      12 allocs/op

After:

% go test -bench='^BenchmarkNumberMatching$' -run=^$ -benchmem
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
BenchmarkNumberMatching-20    	 1835420	       651.9 ns/op	    1563 B/op	       4 allocs/op

This change adds a transmap field to nfaBuffers that is reused across
NFA traversals instead of allocating a new one each time. The transmap
is reset before each use to clear the previous state while preserving
the underlying map capacity.
This benchmark exercises NFA traversal code paths with:
- 16 English letter patterns (A*, B*, C*, etc.)
- 4 complex wildcard patterns (*E*E*E*, *A*B*, etc.)
- 5 CJK patterns (Japanese, Chinese, Korean with wildcards)
- 4 Emoji patterns (🎉, 🚀, ❤️, 🌟 with wildcards)
- 32 test events across all character types

This benchmark is useful for measuring allocation patterns and memory
usage in shellstyle/NFA matching workloads. Memory profiling shows
this PR reduces total allocations by ~7% and transmap.all allocations
by ~19% compared to main.
Makes the code more consistent by using map iteration for funky,
CJK, and emoji patterns instead of individual AddPattern calls.
@sayrer sayrer changed the title Pr/pool transmap buffer Pool transmap buffer Jan 24, 2026
@sayrer
Copy link
Contributor Author

sayrer commented Jan 24, 2026

BenchmarkNumberMatching didn't use transMap at all, it was just allocating nfaBuffers too much. This new benchmark shows allocation reductions.

@timbray
Copy link
Owner

timbray commented Jan 26, 2026

Sorry, maxed out on $OTHER_PROJECT for a day or two, will get back to this and thanks for the work. BTW the first thing I'll do when I take a serious look is run that Benchmark8259Example on a before/after, my intuition is that it applies a lot of useful stress to the right parts of the code.

@sayrer
Copy link
Contributor Author

sayrer commented Jan 26, 2026

No worries, I just do these with coffee in the morning and sort out the good ones (hopefully). There's no time pressure.

@timbray
Copy link
Owner

timbray commented Jan 26, 2026

Glad to hear it. BTW, think you could set yourself up for signed commits? Any old ssh key you have lying around will do, and it's not rocket science, and more and more repos are asking for it. Since I know you personally I'm willing to do the extra acknowledgment that we're violating repo policy here, but still.

@sayrer
Copy link
Contributor Author

sayrer commented Jan 26, 2026

I fixed that. See #479 for example. This one just has some old commits.

@timbray
Copy link
Owner

timbray commented Jan 27, 2026

Haven't looked at the code yet, but…

Before:

go test -bench=^Benchmark8259Example$ -run=^$   
117290/sec
Benchmark8259Example-12    	  149448	      8526 ns/op	     632 B/op	      22 allocs/op
120027/sec
Benchmark8259Example-12    	  148406	      8331 ns/op	     634 B/op	      22 allocs/op

After:

111116/sec
Benchmark8259Example-12    	  142170	      9000 ns/op	     647 B/op	      22 allocs/op
109111/sec
Benchmark8259Example-12    	  136897	      9165 ns/op	     658 B/op	      22 allocs/op

@sayrer
Copy link
Contributor Author

sayrer commented Jan 27, 2026

Well, that's not good. I'll see if that's explainable tomorrow. The allocations shouldn't be the same--that's the puzzling part. So, even if this one is wrong, we should know the reason.

@timbray
Copy link
Owner

timbray commented Jan 27, 2026

Possibly I messed up the switching back and forth between branches? But I think I know what's going on here, thinking about sync.Pool reminded me why it doesn't work that well with Quamina. Go doesn't have thread-local variables, but Quamina does, check out the README about Concurrency. The concurrency primitives behind sync.Whatever() are not free, and so Quamina can afford to do more allocations because it doesn't have to worry about concurrency. If I'd known about sync.Pool() when we were first implementing that part of Quamina, we probably wouldn't have bothered with the non-idiomatic quamina.Copy()but maybe we came out ahead.

Or maybe I'm missing something.

BTW does Claude do escape analysis?

@sayrer
Copy link
Contributor Author

sayrer commented Jan 27, 2026

BTW does Claude do escape analysis?

In a limited sense, yes. It will sometimes come up with fast optimizations, but then realize they are not feasible because the caller owns the return value. It will also say "if you want to make this any faster, you must allocate once in the caller and pass in a slice rather than allocating" (not idiomatic for this code, that would be like a video game, and it does say it's a bad idea unless it's really important)

@sayrer
Copy link
Contributor Author

sayrer commented Jan 27, 2026

Key differences (negative = branch is faster):

Function Flat Δ Cum Δ Notes
traverseNFA -0.16s -0.04s NFA traversal improved
epsilonClosure.getClosure -0.08s -0.33s Significant improvement
mapaccess2_fast64 -0.11s -0.18s Fewer map lookups
mapIterStart -0.03s -0.16s Less map iteration overhead
greyobject -0.06s -0.17s Less GC pressure
runtime.madvise +0.48s +0.48s More memory management
runtime.memclrNoHeapPointers +0.18s +0.18s Pool buffer clearing

The pool optimizations are reducing work in epsilonClosure.getClosure (-0.33s cumulative) and map operations, but adding some memory management overhead. The net effect is roughly neutral to slightly positive on this benchmark.

transitionsBuf and resultBuf are always used.
@timbray
Copy link
Owner

timbray commented Jan 27, 2026

What benchmark is producing these delta numbers? I looked at the code and there's nothing I hate but I'm still not seeing any noticeable effect on my new fave 8259Example benchmark.

@sayrer
Copy link
Contributor Author

sayrer commented Jan 27, 2026

Using the one you want, but this one does seem to vary based on how much -benchtime is there. The lazy fields are valid I think. This test happens to exercise all of them, but not all patterns will. The biggest win is the last commit. The numbers above are with the sync.Pool, but that should be gone now.

Before:

%  go test -bench=Benchmark8259Example -benchmem -benchtime=5s -run=^$    
FA: Field matchers: 2 (avg size 2.500, max 4)
Value matchers: 5
SmallTables 20371 (splices 6, avg 4.033, max 66, epsilons avg 0.001, max 2) singletons 1
97528/sec
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
Benchmark8259Example-20    	  620136	     10253 ns/op	     622 B/op	      22 allocs/op

After:

%  go test -bench=Benchmark8259Example -benchmem -benchtime=5s -run=^$
FA: Field matchers: 2 (avg size 2.500, max 4)
Value matchers: 5
SmallTables 20371 (splices 6, avg 4.033, max 66, epsilons avg 0.001, max 2) singletons 1
100207/sec
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
Benchmark8259Example-20    	  653206	      9979 ns/op	     513 B/op	      17 allocs/op

@sayrer
Copy link
Contributor Author

sayrer commented Jan 27, 2026

Top allocators:

Function Alloc % Notes
epsilonClosure.getClosure 157.24MB 33% Growing the closures map
transmap.all 101.50MB 22% Down from 114MB before
addRuneTreeEntry 79.57MB 17% Pattern setup (one-time)
traverseNFA 66.50MB 14% Various allocations
storeArrayElementField 24.50MB 5% JSON flattening
numbits.toQNumber 19.50MB 4% Number conversion

The transmap.all dropped from 114MB to 101.50MB (~11% reduction). The main remaining allocators are:

  1. epsilonClosure.getClosure (33%) - the closure cache growing
  2. transmap.all (22%) - still allocating a slice per call
  3. traverseNFA (14%) - various internal allocations

@timbray
Copy link
Owner

timbray commented Jan 27, 2026

OK, we're now seeing the same thing.

Switched to branch 'main'
Your branch is up to date with 'origin/main'.
* main
114350/sec
Benchmark8259Example-12    	  147391	      8745 ns/op	     636 B/op	      22 allocs/op
116699/sec
Benchmark8259Example-12    	  141386	      8569 ns/op	     648 B/op	      22 allocs/op
117325/sec
Benchmark8259Example-12    	  146750	      8523 ns/op	     638 B/op	      22 allocs/op
118020/sec
Benchmark8259Example-12    	  148444	      8473 ns/op	     634 B/op	      22 allocs/op
117820/sec
Benchmark8259Example-12    	  150262	      8488 ns/op	     631 B/op	      22 allocs/op
Switched to branch 'pr/pool-transmap-buffer'
* pr/pool-transmap-buffer
118947/sec
Benchmark8259Example-12    	  151880	      8407 ns/op	     532 B/op	      17 allocs/op
116849/sec
Benchmark8259Example-12    	  149766	      8558 ns/op	     536 B/op	      17 allocs/op
118552/sec
Benchmark8259Example-12    	  146882	      8435 ns/op	     541 B/op	      17 allocs/op
117942/sec
Benchmark8259Example-12    	  152149	      8479 ns/op	     531 B/op	      17 allocs/op
120095/sec
Benchmark8259Example-12    	  153506	      8327 ns/op	     529 B/op	      17 allocs/op

Good stuff!

Will now have a closer look at the code.

Copy link
Owner

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments; that aside, I'm happy with this.

Except for: At this point, nfaBufs is becoming an important part of the puzzle. I wonder if it deserves to be pulled out of nfa.go and given its own file with a bit of commentary at the top explaining it. I spend a lot of time looking at nfa.go and it'd be nice to have simple supporting stuff elsewhere to not get in the way. Possible variation: Create nfa_support.go and shuffle nfaBufs and transMap off into that.

@sayrer
Copy link
Contributor Author

sayrer commented Jan 27, 2026

Couple of minor comments; that aside, I'm happy with this.

Except for: At this point, nfaBufs is becoming an important part of the puzzle. I wonder if it deserves to be pulled out of nfa.go and given its own file with a bit of commentary at the top explaining it. I spend a lot of time looking at nfa.go and it'd be nice to have simple supporting stuff elsewhere to not get in the way. Possible variation: Create nfa_support.go and shuffle nfaBufs and transMap off into that.

I can see that, but let me suggest getting the current PRs merged first, before we adjust nfa.go for livability. There are some merge conflicts between this one and #482. After the three patches here are in, that would be the better time to refactor a little.

@timbray timbray merged commit 89b3739 into timbray:main Jan 27, 2026
7 checks passed
@sayrer sayrer deleted the pr/pool-transmap-buffer branch January 27, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants