Fix packetbeat cache janitor goroutine leak#48836
Conversation
Protocol plugins (dns, tcp, mysql, pgsql, mongodb, thrift, amqp, nfs/rpc, icmp) start cache janitor goroutines via StartJanitor() but never call StopJanitor() when the plugin is destroyed during a configuration reload. Each reload cycle leaks two goroutines (one from the DNS/protocol cache, one from the TCP stream cache) and their associated map allocations (~3 MB per cache with the default 64k-slot hash size). Under Fleet management, where policy revisions trigger frequent runner restarts, this causes unbounded memory growth. Add a PluginCloser interface and Close() methods to all protocol plugins that use caches, and call them from the sniffer cleanup path so janitor goroutines are stopped when the sniffer is torn down.
The TransactionPublisher.worker goroutine exits when p.done is closed during Stop(), but never calls client.Close() on the beat.Client it holds. Each configuration reload creates a new client via CreateReporter() that is never released, leaking pipeline client resources. Add defer client.Close() to the worker so clients are properly released when the publisher stops.
StopJanitor closed the janitorQuit channel but never nilled it out, so a second call would panic on closing an already-closed channel. Nil the channel after close so repeated calls are safe.
🤖 GitHub commentsJust comment with:
|
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
|
Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform) |
leehinman
left a comment
There was a problem hiding this comment.
in general adding the Closer LGTM
any chance you could add a test that uses go.uber.org/goleak ? That has helped us catch and track down leaks with beatreceiver startup and shutdown.
beats/x-pack/otel/oteltest/oteltest.go
Lines 209 to 224 in 1cbe74e
done 209a469 |
d8e78e2 to
25bb5a8
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRefactors janitor lifecycle management: libbeat cache uses synchronized quit channels to start/stop janitors. Adds a PluginCloser interface and ProtocolsStruct.Close. Many packetbeat protocol plugins (AMQP, DNS, ICMP, MongoDB, MySQL, NFS, PostgreSQL, TCP, Thrift) gain Close methods to stop their janitors. Sniffer and decoder code now propagate and invoke cleanup functions when decoders change. Multiple handlers hardened with safe type assertions and payload/size guards. New unit tests verify decoder and janitor cleanup behavior. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packetbeat/protos/thrift/thrift.go`:
- Around line 234-236: The Close path currently only calls
thrift.transactions.StopJanitor() but doesn't stop the goroutine started in
thrift.init() that runs thrift.publishTransactions() and ranges over
thrift.publishQueue; modify thriftPlugin to signal and wait for that publisher
to terminate: add a shutdown mechanism (e.g., a done channel or a sync.WaitGroup
plus a closed/closing flag) to the thriftPlugin struct, have Close()
signal/close that channel (or set the flag and close publishQueue in a
coordinated way) and then wait for the publishTransactions goroutine to exit;
ensure producers check the closing flag or observe the closed channel before
sending to thrift.publishQueue so sends cannot occur after Close() begins, and
update thrift.init(), publishTransactions(), and Close() to use this
coordination.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
changelog/fragments/1771343134-packetbeat-libbeat-janitor-cleanup-lifecycle-fix.yamllibbeat/common/cache.gopacketbeat/protos/amqp/amqp.gopacketbeat/protos/dns/dns.gopacketbeat/protos/icmp/icmp.gopacketbeat/protos/icmp/icmp_test.gopacketbeat/protos/mongodb/mongodb.gopacketbeat/protos/mysql/mysql.gopacketbeat/protos/nfs/rpc.gopacketbeat/protos/pgsql/pgsql.gopacketbeat/protos/protos.gopacketbeat/protos/tcp/tcp.gopacketbeat/protos/thrift/thrift.gopacketbeat/publish/publish.gopacketbeat/sniffer/decoders.gopacketbeat/sniffer/decoders_test.gopacketbeat/sniffer/sniffer.gopacketbeat/sniffer/sniffer_test.go
💤 Files with no reviewable changes (1)
- packetbeat/protos/icmp/icmp_test.go
Close the thrift publish queue in Close so publishTransactions can exit cleanly and avoid a lingering goroutine after shutdown. Made-with: Cursor
|
whoops, I gave it a final look and realized it needs to be reworked, I'll fix it on Monday. The protocol janitors were closed prematurely and we'd leak memory, also thrift part has a panic risk |
protocols.Close() was being called from per-decoder cleanup, but the protocols instance outlives the decoder — it is created once per interface in setupSniffer and reused across decoder rebuilds on link-type changes. Calling Close() on decoder replacement could stop protocol janitors while the analyzers were still in use, and in the case of Thrift could panic on double channel close. Move protocols.Close() to run once after Sniffer.Run() exits, matching the actual lifetime of the protocols object. Decoder-level cleanup (ICMP, TCP, UDP) remains per-decoder as before.
|
fixed the lifetimes - now protocol janitors are cleaned up along with the sniffers. |
|
@Mergifyio backport 8.19 9.2 9.3 |
✅ Backports have been createdDetails
Cherry-pick of 70b37f2 has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally
Cherry-pick of 70b37f2 has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally
|
* packetbeat: stop cache janitor goroutines on protocol teardown Protocol plugins (dns, tcp, mysql, pgsql, mongodb, thrift, amqp, nfs/rpc, icmp) start cache janitor goroutines via StartJanitor() but never call StopJanitor() when the plugin is destroyed during a configuration reload. Each reload cycle leaks two goroutines (one from the DNS/protocol cache, one from the TCP stream cache) and their associated map allocations (~3 MB per cache with the default 64k-slot hash size). Under Fleet management, where policy revisions trigger frequent runner restarts, this causes unbounded memory growth. Add a PluginCloser interface and Close() methods to all protocol plugins that use caches, and call them from the sniffer cleanup path so janitor goroutines are stopped when the sniffer is torn down. * packetbeat: close pipeline clients when publisher worker exits The TransactionPublisher.worker goroutine exits when p.done is closed during Stop(), but never calls client.Close() on the beat.Client it holds. Each configuration reload creates a new client via CreateReporter() that is never released, leaking pipeline client resources. Add defer client.Close() to the worker so clients are properly released when the publisher stops. * libbeat: make Cache.StopJanitor idempotent StopJanitor closed the janitorQuit channel but never nilled it out, so a second call would panic on closing an already-closed channel. Nil the channel after close so repeated calls are safe. * packetbeat: add goroutine leak regression test for decoder cleanup * packetbeat: fix janitor leaks and decoder lifecycle on dynamic interface changes * changelog: add fragment for janitor and decoder cleanup fixes * changelog: fixup fragment * packetbeat: fix golangci-lint findings in touched files * packetbeat: more linter fixes * filebeat: fix AD memberOf filter test expectations * packetbeat: Add robust synchronization for Start/StopJanitor * packetbeat/thrift: stop publisher goroutine on shutdown Close the thrift publish queue in Close so publishTransactions can exit cleanly and avoid a lingering goroutine after shutdown. Made-with: Cursor * packetbeat: move protocols.Close() to sniffer shutdown protocols.Close() was being called from per-decoder cleanup, but the protocols instance outlives the decoder — it is created once per interface in setupSniffer and reused across decoder rebuilds on link-type changes. Calling Close() on decoder replacement could stop protocol janitors while the analyzers were still in use, and in the case of Thrift could panic on double channel close. Move protocols.Close() to run once after Sniffer.Run() exits, matching the actual lifetime of the protocols object. Decoder-level cleanup (ICMP, TCP, UDP) remains per-decoder as before. (cherry picked from commit 70b37f2) # Conflicts: # packetbeat/beater/processor.go # packetbeat/sniffer/sniffer.go
* packetbeat: stop cache janitor goroutines on protocol teardown Protocol plugins (dns, tcp, mysql, pgsql, mongodb, thrift, amqp, nfs/rpc, icmp) start cache janitor goroutines via StartJanitor() but never call StopJanitor() when the plugin is destroyed during a configuration reload. Each reload cycle leaks two goroutines (one from the DNS/protocol cache, one from the TCP stream cache) and their associated map allocations (~3 MB per cache with the default 64k-slot hash size). Under Fleet management, where policy revisions trigger frequent runner restarts, this causes unbounded memory growth. Add a PluginCloser interface and Close() methods to all protocol plugins that use caches, and call them from the sniffer cleanup path so janitor goroutines are stopped when the sniffer is torn down. * packetbeat: close pipeline clients when publisher worker exits The TransactionPublisher.worker goroutine exits when p.done is closed during Stop(), but never calls client.Close() on the beat.Client it holds. Each configuration reload creates a new client via CreateReporter() that is never released, leaking pipeline client resources. Add defer client.Close() to the worker so clients are properly released when the publisher stops. * libbeat: make Cache.StopJanitor idempotent StopJanitor closed the janitorQuit channel but never nilled it out, so a second call would panic on closing an already-closed channel. Nil the channel after close so repeated calls are safe. * packetbeat: add goroutine leak regression test for decoder cleanup * packetbeat: fix janitor leaks and decoder lifecycle on dynamic interface changes * changelog: add fragment for janitor and decoder cleanup fixes * changelog: fixup fragment * packetbeat: fix golangci-lint findings in touched files * packetbeat: more linter fixes * filebeat: fix AD memberOf filter test expectations * packetbeat: Add robust synchronization for Start/StopJanitor * packetbeat/thrift: stop publisher goroutine on shutdown Close the thrift publish queue in Close so publishTransactions can exit cleanly and avoid a lingering goroutine after shutdown. Made-with: Cursor * packetbeat: move protocols.Close() to sniffer shutdown protocols.Close() was being called from per-decoder cleanup, but the protocols instance outlives the decoder — it is created once per interface in setupSniffer and reused across decoder rebuilds on link-type changes. Calling Close() on decoder replacement could stop protocol janitors while the analyzers were still in use, and in the case of Thrift could panic on double channel close. Move protocols.Close() to run once after Sniffer.Run() exits, matching the actual lifetime of the protocols object. Decoder-level cleanup (ICMP, TCP, UDP) remains per-decoder as before. (cherry picked from commit 70b37f2) # Conflicts: # packetbeat/beater/processor.go # packetbeat/sniffer/sniffer.go
* packetbeat: stop cache janitor goroutines on protocol teardown Protocol plugins (dns, tcp, mysql, pgsql, mongodb, thrift, amqp, nfs/rpc, icmp) start cache janitor goroutines via StartJanitor() but never call StopJanitor() when the plugin is destroyed during a configuration reload. Each reload cycle leaks two goroutines (one from the DNS/protocol cache, one from the TCP stream cache) and their associated map allocations (~3 MB per cache with the default 64k-slot hash size). Under Fleet management, where policy revisions trigger frequent runner restarts, this causes unbounded memory growth. Add a PluginCloser interface and Close() methods to all protocol plugins that use caches, and call them from the sniffer cleanup path so janitor goroutines are stopped when the sniffer is torn down. * packetbeat: close pipeline clients when publisher worker exits The TransactionPublisher.worker goroutine exits when p.done is closed during Stop(), but never calls client.Close() on the beat.Client it holds. Each configuration reload creates a new client via CreateReporter() that is never released, leaking pipeline client resources. Add defer client.Close() to the worker so clients are properly released when the publisher stops. * libbeat: make Cache.StopJanitor idempotent StopJanitor closed the janitorQuit channel but never nilled it out, so a second call would panic on closing an already-closed channel. Nil the channel after close so repeated calls are safe. * packetbeat: add goroutine leak regression test for decoder cleanup * packetbeat: fix janitor leaks and decoder lifecycle on dynamic interface changes * changelog: add fragment for janitor and decoder cleanup fixes * changelog: fixup fragment * packetbeat: fix golangci-lint findings in touched files * packetbeat: more linter fixes * filebeat: fix AD memberOf filter test expectations * packetbeat: Add robust synchronization for Start/StopJanitor * packetbeat/thrift: stop publisher goroutine on shutdown Close the thrift publish queue in Close so publishTransactions can exit cleanly and avoid a lingering goroutine after shutdown. Made-with: Cursor * packetbeat: move protocols.Close() to sniffer shutdown protocols.Close() was being called from per-decoder cleanup, but the protocols instance outlives the decoder — it is created once per interface in setupSniffer and reused across decoder rebuilds on link-type changes. Calling Close() on decoder replacement could stop protocol janitors while the analyzers were still in use, and in the case of Thrift could panic on double channel close. Move protocols.Close() to run once after Sniffer.Run() exits, matching the actual lifetime of the protocols object. Decoder-level cleanup (ICMP, TCP, UDP) remains per-decoder as before. (cherry picked from commit 70b37f2)
Proposed commit message
Checklist
stresstest.shscript to run them under stress conditions and race detector to verify their stability../changelog/fragmentsusing the changelog tool.Disruptive User Impact
Author's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs