-
Notifications
You must be signed in to change notification settings - Fork 26
Pool transmap buffer #478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pool transmap buffer #478
Conversation
This change adds a transmap field to nfaBuffers that is reused across NFA traversals instead of allocating a new one each time. The transmap is reset before each use to clear the previous state while preserving the underlying map capacity.
This benchmark exercises NFA traversal code paths with: - 16 English letter patterns (A*, B*, C*, etc.) - 4 complex wildcard patterns (*E*E*E*, *A*B*, etc.) - 5 CJK patterns (Japanese, Chinese, Korean with wildcards) - 4 Emoji patterns (🎉, 🚀, ❤️, 🌟 with wildcards) - 32 test events across all character types This benchmark is useful for measuring allocation patterns and memory usage in shellstyle/NFA matching workloads. Memory profiling shows this PR reduces total allocations by ~7% and transmap.all allocations by ~19% compared to main.
Makes the code more consistent by using map iteration for funky, CJK, and emoji patterns instead of individual AddPattern calls.
|
BenchmarkNumberMatching didn't use transMap at all, it was just allocating nfaBuffers too much. This new benchmark shows allocation reductions. |
|
Sorry, maxed out on $OTHER_PROJECT for a day or two, will get back to this and thanks for the work. BTW the first thing I'll do when I take a serious look is run that |
|
No worries, I just do these with coffee in the morning and sort out the good ones (hopefully). There's no time pressure. |
|
Glad to hear it. BTW, think you could set yourself up for signed commits? Any old ssh key you have lying around will do, and it's not rocket science, and more and more repos are asking for it. Since I know you personally I'm willing to do the extra acknowledgment that we're violating repo policy here, but still. |
|
I fixed that. See #479 for example. This one just has some old commits. |
|
Haven't looked at the code yet, but… Before: After: |
|
Well, that's not good. I'll see if that's explainable tomorrow. The allocations shouldn't be the same--that's the puzzling part. So, even if this one is wrong, we should know the reason. |
|
Possibly I messed up the switching back and forth between branches? But I think I know what's going on here, thinking about sync.Pool reminded me why it doesn't work that well with Quamina. Go doesn't have thread-local variables, but Quamina does, check out the README about Concurrency. The concurrency primitives behind sync.Whatever() are not free, and so Quamina can afford to do more allocations because it doesn't have to worry about concurrency. If I'd known about sync.Pool() when we were first implementing that part of Quamina, we probably wouldn't have bothered with the non-idiomatic Or maybe I'm missing something. BTW does Claude do escape analysis? |
In a limited sense, yes. It will sometimes come up with fast optimizations, but then realize they are not feasible because the caller owns the return value. It will also say "if you want to make this any faster, you must allocate once in the caller and pass in a slice rather than allocating" (not idiomatic for this code, that would be like a video game, and it does say it's a bad idea unless it's really important) |
|
Key differences (negative = branch is faster):
The pool optimizations are reducing work in |
transitionsBuf and resultBuf are always used.
|
What benchmark is producing these delta numbers? I looked at the code and there's nothing I hate but I'm still not seeing any noticeable effect on my new fave 8259Example benchmark. |
|
Using the one you want, but this one does seem to vary based on how much -benchtime is there. The lazy fields are valid I think. This test happens to exercise all of them, but not all patterns will. The biggest win is the last commit. The numbers above are with the sync.Pool, but that should be gone now. Before: After: |
|
Top allocators:
The
|
|
OK, we're now seeing the same thing. Good stuff! Will now have a closer look at the code. |
timbray
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor comments; that aside, I'm happy with this.
Except for: At this point, nfaBufs is becoming an important part of the puzzle. I wonder if it deserves to be pulled out of nfa.go and given its own file with a bit of commentary at the top explaining it. I spend a lot of time looking at nfa.go and it'd be nice to have simple supporting stuff elsewhere to not get in the way. Possible variation: Create nfa_support.go and shuffle nfaBufs and transMap off into that.
I can see that, but let me suggest getting the current PRs merged first, before we adjust |
This one was written for
TestRulerCl2, but didn't show up there in wall-clock time because the savings weren't evident relative to the allocation of the test data. The savings were there in the pprof file. It turns out you can see it in BenchmarkNumberMatching quite clearly.This adds the pool to coreMatcher that got a meh in #470, because lack of that is distorting the benchmarks. If you add these cached buffers, making a new nfaBuffers struct becomes more expensive. So, not having that in coreMatcher was distorting the cost (and the high-level API already does this optimization).
This patch was a regression (14 allocs/op) before I changed coreMatcher.
This one also adds a new test, BenchmarkShellstyleMultiMatch, that I used while profiling. I figured it was worth keeping.
Before:
After: