-
Notifications
You must be signed in to change notification settings - Fork 26
More allocation avoidance. #470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Yeah, this one is a tougher sell. A few notes:
|
|
Maybe a separate PR with that benchmark program first? |
I think the Quamina code does allocate too much, but |
|
Also, totally willing to separate these PRs. But sometimes this kind of stuff needs to be together, so things start fitting in caches. So, four seemingly ineffective optimizations all of a sudden work. |
Well, the underlying benchmark decompresses and loads 219M into memory, so yeah, that's a lot of allocation, which needs to be disentangled from the Quamina MatchesForEvent() code. Anyhow, will have a look at your benchmark. Obviously, it's real-world clock time that matters. |
|
It's not a bad thing, fwiw. It means the Quamina code is faster than the surrounding code, so it must be isolated. |
|
Will have a look at that memory profile later today. Might add a CPU profile too. |
|
Hmm, I was trying to compare the runtimes of your branch and my "issue 66" PR branch by running your perf.go program, and was getting basically identical numbers, and then I see it imports "quamina.net/go/quamina" and I don't know enough about the go "import" mechanism but is it possible it's fetching the same library code on both sides? |
|
I think that is probably the issue. I'll look at the import paths and then make the thing slower to check that it's importing the correct library. |
|
Huh, can't figure out the incantation to make the memory profiler ignore the getCl2Lines and just profile the MatchesForEvent & descendants. The cpu profiler reveals … no surprises. This particular benchmark really stresses FlattenJSON because the field values are big and the benchmarks use most of them. I'm going to have to buckle down and invest some time into building an actually useful benchmarker that lets the profiler cast light on what's going on in Quamina's own code. |
|
While I get this patch in order, which may take some time, I have seen this pattern before. It's spending a lot of CPU time in the allocator. |
|
Yeah, it's complicated. The pattern of memory usage varies a lot depending on what kind of patterns you're looking for and what kind of records you're scanning. And is often going to be dominated by the flattener code. The patterns that can run in DFA mode (everything but "wildcard" and "regexp") are really hard to improve at this point, they've been profiled and optimized a lot. The ones that need NFA mode have both automaton-creation and automaton-traversal ugliness, and haven't been really well profiled yet. Once again, the biggest win would be eliminating epsilon transitions, if Claude could do that I'd be super impressed. |
|
I went at this for a little bit, but the result was that the tests measured data loading more than Quamina. So, looked at a few different angles, and found that every time. |
Not sure what you're saying here. Does this latest version provide a significant performance increase? |
|
The results I consistently found were that these were faster, but that the benchmarks measure too much non-Quamina code. As in, code that Quamina doesn't even call into. Just benchmark setup, data loading, etc. The profiler said, after the latest patch, "this is the fastest you can make it in idiomatic Go", and I agree. We can let this one sit if you don't mind, it's not harmful. I was working on a similar task in another project, and I suspect that the benchmarks are the thing that needs work first (as you said above). |
|
Running that perf program and inspecting the generated files, it definitely makes a difference. Something like
(you need Graphviz) Firstly, the benchmark is spending 65% of the program loading that big file, leaving only a few hundred ms of code to benchmark. If you look at those pprof graphs with and without the patches here, there is a huge difference in the actual Quamina code. It is tough to discern because of noise from loading that file and pprof itself, but there is a long callgraph under the traverse functions that is just gone. |
|
Having slept on this, I filed #471 - when we're considering optimizations, we shouldn't have to dig through the entrails of the profiler output, the effects should be straightforwardly visible. So I'm going to leave this PR on ice until I (or someone else) gets a better benchmark in place. I don't want to close the PR though because I think it'll be useful in benchmarking proposed benchmarks. |
|
OK, I added a simple (but I think quite useful) benchmark that I'm pretty sure focuses on the interesting part of the problem. Why don't you try running this PR against that? |
|
I'll try it when I get chance (probably in a day or so). I think the benchmark in #473 looks good but needs to have the input randomized in a loop to insert at least CJK characters and emoji. (this is because you get into the really gnarly parts of the unicode lookup tables, and it's easy to optimize for ASCII if you don't do this, but you really want to hit the variable-length byte cases of UTF-8) |
1. **Removed slab allocator** from `epsilonClosure` - was causing more allocations (this was only on this branch, not on main) 2. **Pooled match results** - Added `resultBuf` to `nfaBuffers` with `matchSet.matchesInto()` method 3. **Pooled DFA transitions** - Added `dfaTransitions` buffer to `nfaBuffers` for `traverseDFA()` in the matching hot path (not for `traverseNFA` since results are returned to callers) 4. **Fixed buffer initialization** - Ensured `buf1` and `buf2` are cleared at start of `traverseNFA Metric Before After Improvement ───────────────────────────────────────────────────────────────────────────── Bytes/op 993 B/op 666 B/op -33% Allocs/op 31 allocs/op 22 allocs/op -29% Throughput ~97k/sec ~101k/sec +4%
|
This benchmark is still so fast that the throughput is a little noisy, but this summary is about what I got, and the Bytes and Alloc numbers are consistent both on main and this branch. |
timbray
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little dubious about changing a bunch of things at once, but I guess I can live with this. Do please glance at comments.
| func (m *coreMatcher) matchesForJSONWithFlattener(event []byte, f Flattener) ([]X, error) { | ||
| fields, _ := f.Flatten(event, m.getSegmentsTreeTracker()) | ||
| return m.matchesForFields(fields, newNfaBuffers()) | ||
| bufs := m.bufferPool.Get().(*nfaBuffers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I guess, but this func is only called in benchmark scenarios. Glance at Quamina.MatchesForEvent for the production path. The higher level routine is much smarter about keeping the nfaBufs around so I'd like to migrate our benchmarks to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, I lied. It's mostly called in testing scenarios.
| matches *matchSet | ||
| transmap *transmap | ||
| resultBuf []X | ||
| dfaTransitions []*fieldMatcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little nervous on general engineering principles about changing a bunch of. things at once. I wonder which of these actually introduce the measured benefits? Probably not a blocker I guess. See comments about sync.Pool above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK, I can break them up if you want. They were all tested individually, but they're all pretty much the same idea. As in, I did one after another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too much work, I'd sure appreciate that. Will probably end up taking all/most of them, but still.
| type coreMatcher struct { | ||
| updateable atomic.Pointer[coreFields] | ||
| lock sync.Mutex | ||
| bufferPool sync.Pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dubious about this one, tried it before and it produced zero measurable change. Probably harmless I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I can do this one last. If it helps, it'll be easier to see at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're going to re-spin the PR? Excellent, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this one open for the conversation, I'll hang some smaller ones off this one with a number reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also has the advantage of teaching us things by observing the cpu/memory effects of the tweaks. BTW I'm using Goland which has this lovely thing where you can run any test or benchmark with cpu or memory profiling and it draws you a flame graph in the IDE that's full of lessons. Recommended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, these are all done with pprof. Sometimes graphical visualizations, but often just reading the output of go tool pprof -top. Something like
go test -bench=^Benchmark8259Example$ -run=^$ -cpuprofile=cpu_pr.prof -benchtime=5s
|
I think I'll try some of these in a more constrained environment. That second patch was a huge allocation win, and only a little bit of a CPU win. But I think that's not realistic. Nothing in the queue is a loss, but memory is free for this program on a Mac Studio with 64GB of RAM (or Tim's M2 Macbook, which seems to be a little faster). This program will be running in container/VM/EC2 instances that are not designed to let you forget about what you have running. |
|
FWIW I am working on a parallel track, namely reviving the neglected nfa2Dfa code. The new benchmark runs just over twice as fast when I squish it into a DFA - this with your latest patch: However, there are issues. First, nfa2dfa has a weird corner-case bug I'm chasing. Second, you can't just automatically apply this to every NFA because for certain NFAs you get into horrible O(2**N) scaling. So it's going to have to be an option and is going to need API design. |
|
Summary of the combined optimizations from #478, #479, #482
|
|
Closing this as superseded by all that other work, for which thanks. |
I think this patch is valid, and
perf.shis doing too much measuring of loading test data. This one will not be an easy rubber-stamp like the other Claude patches. Theperf.shnumbers don't look like an improvement, but the profiles are much different.Before:
After:
Comparing the original profile to the final one after optimizations, here is the shift in where CPU time is spent:
Original Profile (High Churn)
The profile was dominated by the Go runtime managing memory (allocating, clearing, sweeping), rather than doing useful work.
Interpretation: Nearly 35% of the "flat" CPU time was spent purely on memory mechanics (
madvise,memclr,malloc). The actual matching logic (transitionOn) was buried.Optimized Profile (Efficient Work)
Now, the profile looks much healthier. The top consumers are dominated by test data loading (which is unavoidable in this benchmark) and actual application logic.
Interpretation:
madviseis still present (likely due to the massive initial data load ofcitylots.jsoninto memory), the raw time spent there dropped from 3.25s to 0.84s.readObjectandMatchesForEventare now much more prominent relative to the total runtime (excluding the test setup).go testoutput), meaning the CPU is doing far less "administrative" work for every event matched.The allocation-heavy functions like
mallocgcTinyhave effectively disappeared from the top of the profile, confirming that the pooling and slab allocation strategies successfully eliminated the allocation churn.