Pool transmap buffer #478

sayrer · 2026-01-24T18:08:43Z

This one was written for TestRulerCl2, but didn't show up there in wall-clock time because the savings weren't evident relative to the allocation of the test data. The savings were there in the pprof file. It turns out you can see it in BenchmarkNumberMatching quite clearly.

This adds the pool to coreMatcher that got a meh in #470, because lack of that is distorting the benchmarks. If you add these cached buffers, making a new nfaBuffers struct becomes more expensive. So, not having that in coreMatcher was distorting the cost (and the high-level API already does this optimization).

This patch was a regression (14 allocs/op) before I changed coreMatcher.

This one also adds a new test, BenchmarkShellstyleMultiMatch, that I used while profiling. I figured it was worth keeping.

Before:

 % go test -bench='^BenchmarkNumberMatching$' -run=^$ -benchmem
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
BenchmarkNumberMatching-20    	 1000000	      1023 ns/op	    2418 B/op	      12 allocs/op

After:

% go test -bench='^BenchmarkNumberMatching$' -run=^$ -benchmem
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
BenchmarkNumberMatching-20    	 1835420	       651.9 ns/op	    1563 B/op	       4 allocs/op

This change adds a transmap field to nfaBuffers that is reused across NFA traversals instead of allocating a new one each time. The transmap is reset before each use to clear the previous state while preserving the underlying map capacity.

This benchmark exercises NFA traversal code paths with: - 16 English letter patterns (A*, B*, C*, etc.) - 4 complex wildcard patterns (*E*E*E*, *A*B*, etc.) - 5 CJK patterns (Japanese, Chinese, Korean with wildcards) - 4 Emoji patterns (🎉, 🚀, ❤️, 🌟 with wildcards) - 32 test events across all character types This benchmark is useful for measuring allocation patterns and memory usage in shellstyle/NFA matching workloads. Memory profiling shows this PR reduces total allocations by ~7% and transmap.all allocations by ~19% compared to main.

Makes the code more consistent by using map iteration for funky, CJK, and emoji patterns instead of individual AddPattern calls.

sayrer · 2026-01-24T18:58:24Z

BenchmarkNumberMatching didn't use transMap at all, it was just allocating nfaBuffers too much. This new benchmark shows allocation reductions.

timbray · 2026-01-26T18:47:53Z

Sorry, maxed out on $OTHER_PROJECT for a day or two, will get back to this and thanks for the work. BTW the first thing I'll do when I take a serious look is run that Benchmark8259Example on a before/after, my intuition is that it applies a lot of useful stress to the right parts of the code.

sayrer · 2026-01-26T19:48:13Z

No worries, I just do these with coffee in the morning and sort out the good ones (hopefully). There's no time pressure.

timbray · 2026-01-26T19:52:56Z

Glad to hear it. BTW, think you could set yourself up for signed commits? Any old ssh key you have lying around will do, and it's not rocket science, and more and more repos are asking for it. Since I know you personally I'm willing to do the extra acknowledgment that we're violating repo policy here, but still.

sayrer · 2026-01-26T19:56:11Z

I fixed that. See #479 for example. This one just has some old commits.

timbray · 2026-01-27T00:24:16Z

Haven't looked at the code yet, but…

Before:

go test -bench=^Benchmark8259Example$ -run=^$   
117290/sec
Benchmark8259Example-12    	  149448	      8526 ns/op	     632 B/op	      22 allocs/op
120027/sec
Benchmark8259Example-12    	  148406	      8331 ns/op	     634 B/op	      22 allocs/op

After:

111116/sec
Benchmark8259Example-12    	  142170	      9000 ns/op	     647 B/op	      22 allocs/op
109111/sec
Benchmark8259Example-12    	  136897	      9165 ns/op	     658 B/op	      22 allocs/op

sayrer · 2026-01-27T01:40:59Z

Well, that's not good. I'll see if that's explainable tomorrow. The allocations shouldn't be the same--that's the puzzling part. So, even if this one is wrong, we should know the reason.

timbray · 2026-01-27T02:53:30Z

Possibly I messed up the switching back and forth between branches? But I think I know what's going on here, thinking about sync.Pool reminded me why it doesn't work that well with Quamina. Go doesn't have thread-local variables, but Quamina does, check out the README about Concurrency. The concurrency primitives behind sync.Whatever() are not free, and so Quamina can afford to do more allocations because it doesn't have to worry about concurrency. If I'd known about sync.Pool() when we were first implementing that part of Quamina, we probably wouldn't have bothered with the non-idiomatic quamina.Copy()but maybe we came out ahead.

Or maybe I'm missing something.

BTW does Claude do escape analysis?

sayrer · 2026-01-27T04:41:39Z

BTW does Claude do escape analysis?

In a limited sense, yes. It will sometimes come up with fast optimizations, but then realize they are not feasible because the caller owns the return value. It will also say "if you want to make this any faster, you must allocate once in the caller and pass in a slice rather than allocating" (not idiomatic for this code, that would be like a video game, and it does say it's a bad idea unless it's really important)

sayrer · 2026-01-27T16:44:24Z

Key differences (negative = branch is faster):

Function	Flat Δ	Cum Δ	Notes
`traverseNFA`	-0.16s	-0.04s	NFA traversal improved
`epsilonClosure.getClosure`	-0.08s	-0.33s	Significant improvement
`mapaccess2_fast64`	-0.11s	-0.18s	Fewer map lookups
`mapIterStart`	-0.03s	-0.16s	Less map iteration overhead
`greyobject`	-0.06s	-0.17s	Less GC pressure
`runtime.madvise`	+0.48s	+0.48s	More memory management
`runtime.memclrNoHeapPointers`	+0.18s	+0.18s	Pool buffer clearing

The pool optimizations are reducing work in epsilonClosure.getClosure (-0.33s cumulative) and map operations, but adding some memory management overhead. The net effect is roughly neutral to slightly positive on this benchmark.

transitionsBuf and resultBuf are always used.

timbray · 2026-01-27T17:23:09Z

What benchmark is producing these delta numbers? I looked at the code and there's nothing I hate but I'm still not seeing any noticeable effect on my new fave 8259Example benchmark.

sayrer · 2026-01-27T17:31:32Z

Using the one you want, but this one does seem to vary based on how much -benchtime is there. The lazy fields are valid I think. This test happens to exercise all of them, but not all patterns will. The biggest win is the last commit. The numbers above are with the sync.Pool, but that should be gone now.

Before:

%  go test -bench=Benchmark8259Example -benchmem -benchtime=5s -run=^$    
FA: Field matchers: 2 (avg size 2.500, max 4)
Value matchers: 5
SmallTables 20371 (splices 6, avg 4.033, max 66, epsilons avg 0.001, max 2) singletons 1
97528/sec
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
Benchmark8259Example-20    	  620136	     10253 ns/op	     622 B/op	      22 allocs/op

After:

%  go test -bench=Benchmark8259Example -benchmem -benchtime=5s -run=^$
FA: Field matchers: 2 (avg size 2.500, max 4)
Value matchers: 5
SmallTables 20371 (splices 6, avg 4.033, max 66, epsilons avg 0.001, max 2) singletons 1
100207/sec
goos: darwin
goarch: arm64
pkg: quamina.net/go/quamina
cpu: Apple M1 Ultra
Benchmark8259Example-20    	  653206	      9979 ns/op	     513 B/op	      17 allocs/op

sayrer · 2026-01-27T17:41:10Z

Top allocators:

Function	Alloc	%	Notes
`epsilonClosure.getClosure`	157.24MB	33%	Growing the closures map
`transmap.all`	101.50MB	22%	Down from 114MB before
`addRuneTreeEntry`	79.57MB	17%	Pattern setup (one-time)
`traverseNFA`	66.50MB	14%	Various allocations
`storeArrayElementField`	24.50MB	5%	JSON flattening
`numbits.toQNumber`	19.50MB	4%	Number conversion

The transmap.all dropped from 114MB to 101.50MB (~11% reduction). The main remaining allocators are:

epsilonClosure.getClosure (33%) - the closure cache growing
transmap.all (22%) - still allocating a slice per call
traverseNFA (14%) - various internal allocations

timbray · 2026-01-27T19:08:38Z

OK, we're now seeing the same thing.

Switched to branch 'main'
Your branch is up to date with 'origin/main'.
* main
114350/sec
Benchmark8259Example-12    	  147391	      8745 ns/op	     636 B/op	      22 allocs/op
116699/sec
Benchmark8259Example-12    	  141386	      8569 ns/op	     648 B/op	      22 allocs/op
117325/sec
Benchmark8259Example-12    	  146750	      8523 ns/op	     638 B/op	      22 allocs/op
118020/sec
Benchmark8259Example-12    	  148444	      8473 ns/op	     634 B/op	      22 allocs/op
117820/sec
Benchmark8259Example-12    	  150262	      8488 ns/op	     631 B/op	      22 allocs/op
Switched to branch 'pr/pool-transmap-buffer'
* pr/pool-transmap-buffer
118947/sec
Benchmark8259Example-12    	  151880	      8407 ns/op	     532 B/op	      17 allocs/op
116849/sec
Benchmark8259Example-12    	  149766	      8558 ns/op	     536 B/op	      17 allocs/op
118552/sec
Benchmark8259Example-12    	  146882	      8435 ns/op	     541 B/op	      17 allocs/op
117942/sec
Benchmark8259Example-12    	  152149	      8479 ns/op	     531 B/op	      17 allocs/op
120095/sec
Benchmark8259Example-12    	  153506	      8327 ns/op	     529 B/op	      17 allocs/op

Good stuff!

Will now have a closer look at the code.

timbray

Couple of minor comments; that aside, I'm happy with this.

Except for: At this point, nfaBufs is becoming an important part of the puzzle. I wonder if it deserves to be pulled out of nfa.go and given its own file with a bit of commentary at the top explaining it. I spend a lot of time looking at nfa.go and it'd be nice to have simple supporting stuff elsewhere to not get in the way. Possible variation: Create nfa_support.go and shuffle nfaBufs and transMap off into that.

shellstyle_bench_test.go

transmap_bench_test.go

…ng event

sayrer · 2026-01-27T20:02:19Z

Couple of minor comments; that aside, I'm happy with this.

Except for: At this point, nfaBufs is becoming an important part of the puzzle. I wonder if it deserves to be pulled out of nfa.go and given its own file with a bit of commentary at the top explaining it. I spend a lot of time looking at nfa.go and it'd be nice to have simple supporting stuff elsewhere to not get in the way. Possible variation: Create nfa_support.go and shuffle nfaBufs and transMap off into that.

I can see that, but let me suggest getting the current PRs merged first, before we adjust nfa.go for livability. There are some merge conflicts between this one and #482. After the three patches here are in, that would be the better time to refactor a little.

sayrer added 5 commits January 23, 2026 12:51

Refactor benchmark to use loops for all pattern categories

f05309f

Makes the code more consistent by using map iteration for funky, CJK, and emoji patterns instead of individual AddPattern calls.

Merge main.

1577a4f

Add a pool to coreMatcher.

897e194

sayrer changed the title ~~Pr/pool transmap buffer~~ Pool transmap buffer Jan 24, 2026

sayrer added 2 commits January 24, 2026 10:14

Add error checking to the new benchmark.

590e84d

Add a bench just for transmap.

397b7b1

Improve the comments a little.

c8e8bbe

sayrer added 2 commits January 27, 2026 09:03

Lazily initialize nfaBuffers fields where possible.

e14ee66

transitionsBuf and resultBuf are always used.

Fix lint.

34fc88d

Preallocate in transmap.all() because the size is known.

aa17200

timbray reviewed Jan 27, 2026

View reviewed changes

shellstyle_bench_test.go Outdated Show resolved Hide resolved

transmap_bench_test.go Outdated Show resolved Hide resolved

sayrer added 2 commits January 27, 2026 11:37

Fix shellstyle benchmark to check return values and remove non-matchi…

10e45a0

…ng event

This test is no longer needed.

cb44e97

sayrer mentioned this pull request Jan 27, 2026

More allocation avoidance. #470

Closed

timbray merged commit 89b3739 into timbray:main Jan 27, 2026
7 checks passed

sayrer deleted the pr/pool-transmap-buffer branch January 27, 2026 20:40

Pool transmap buffer #478

Pool transmap buffer #478

Uh oh!

Conversation

sayrer commented Jan 24, 2026

Uh oh!

sayrer commented Jan 24, 2026

Uh oh!

timbray commented Jan 26, 2026

Uh oh!

sayrer commented Jan 26, 2026

Uh oh!

timbray commented Jan 26, 2026

Uh oh!

sayrer commented Jan 26, 2026

Uh oh!

timbray commented Jan 27, 2026

Uh oh!

sayrer commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray commented Jan 27, 2026

Uh oh!

sayrer commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Jan 27, 2026

Uh oh!

timbray commented Jan 27, 2026

Uh oh!

sayrer commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Jan 27, 2026

Uh oh!

timbray commented Jan 27, 2026

Uh oh!

timbray left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayrer commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sayrer commented Jan 27, 2026 •

edited

Loading

sayrer commented Jan 27, 2026 •

edited

Loading

sayrer commented Jan 27, 2026 •

edited

Loading