Split BeatV2Manager Start into two methods, so Beats can reply to check-in in parallel to its initialisation#49796
Split BeatV2Manager Start into two methods, so Beats can reply to check-in in parallel to its initialisation#49796belimawr wants to merge 17 commits intoelastic:mainfrom
BeatV2Manager Start into two methods, so Beats can reply to check-in in parallel to its initialisation#49796Conversation
Split Start into PreStart and PostStart so the manager can start replying to check-in before the Beat is ready to start/stop inputs/output.
Ensure the timer in unitListen always fires until the Beat is ready and the first config is applied.
🤖 GitHub commentsJust comment with:
|
GenAI-Assisted: Yes Human-Reviewed: Yes Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High Fast
Add the 'WaitForStop' method in the V2 manager to wait all goroutines to exit. This prevents panics in test when the `testing.T` is used as the manager's logger's output.
Keep the irregular input active until the test observes DEGRADED and shorten the per-state timeout to avoid long hangs when transitions are missed. Made-with: Cursor
TL;DR
Remediation
Investigation detailsRoot CauseBuildkite fails in In the PR branch (
Evidence
Local reproduction on Verification
Follow-upAfter committing the regenerated Note 🔒 Integrity filtering filtered 1 itemIntegrity filtering activated and filtered the following item during workflow execution.
What is this? | From workflow: PR Buildkite Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
BeatV2Manager into two methods, so Beats can reply to check-in in parallel to its initialisation
There was a problem hiding this comment.
Pull request overview
This PR restructures the V2 management manager startup so Beats can begin responding to Elastic Agent check-ins earlier (before full Beat initialization), avoiding check-in timeouts and resulting restart/crash loops.
Changes:
- Split
BeatV2Manager.Start()intoPreInit()(start client + check-ins) andPostInit()(mark running + enable applying unit changes), keepingStart()as a deprecated compatibility wrapper. - Start Filebeat’s manager earlier in
Filebeat.Run()and add shutdown waiting support viaWaitForStop()(primarily for tests). - Update integration/unit tests and mock managers to align with the new lifecycle.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| x-pack/libbeat/management/managerV2.go | Implements PreInit/PostInit, adds goroutine waiting via WaitForStop, adjusts unit application timing. |
| libbeat/management/management.go | Extends Manager interface with PreInit, PostInit, and WaitForStop; deprecates Start behaviorally. |
| filebeat/beater/filebeat.go | Moves manager startup earlier (pre state-store load) and calls PostInit after initialization completes. |
| x-pack/libbeat/management/managerV2_test.go | Adds tests for buffered units behavior and uses WaitForStop to ensure clean shutdown. |
| x-pack/filebeat/tests/integration/status_reporter_test.go | Tweaks integration test scenario expectations/inputs and lint annotations. |
| x-pack/filebeat/tests/integration/managerV2_test.go | Updates integration expectations around unit states/messages. |
| x-pack/otel/otelmanager/manager.go | Updates stub manager to satisfy the expanded Manager interface. |
| x-pack/osquerybeat/beater/osquerybeat_status_test.go | Updates test manager stub + minor assertion/lint improvements. |
| libbeat/cmd/instance/beat_test.go | Updates test mock manager to satisfy new interface. |
| libbeat/beat/beat_test.go | Updates test manager to satisfy new interface. |
| changelog/fragments/1774984035-move-manager-start.yaml | Adds a changelog fragment describing the Filebeat crash-loop fix. |
Comments suppressed due to low confidence (1)
x-pack/libbeat/management/managerV2.go:304
PreInitdrops the underlying error fromcm.client.Start(ctx), returning a generic message only. This makes failures hard to diagnose; please wrap/return the original error (e.g., with%w) and include context in the message.
ctx := context.Background()
err := cm.client.Start(ctx)
if err != nil {
return fmt.Errorf("error starting connection to client")
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
📝 WalkthroughWalkthroughThis pull request refactors the 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@x-pack/filebeat/tests/integration/managerV2_test.go`:
- Line 286: The test incorrectly expects proto.State_HEALTHY for an input that
the rest of the test and the counterpart managerV2_test.go treat as remaining in
STARTING after a failed output reload; update the expectation(s) to match the
intended failed-output path by replacing proto.State_HEALTHY with
proto.State_STARTING (also fix the duplicate occurrences noted around the same
test at the other location referenced, lines 332-334) or alternatively adjust
the surrounding test setup and assertions so the input is explicitly
transitioned to HEALTHY everywhere (ensure consistency between the State field
in the failing-output scenario and the assertions in managerV2_test.go).
In `@x-pack/libbeat/management/managerV2_test.go`:
- Around line 455-472: The test flips beatReady too late, causing CheckinV2Impl
to assert units are STARTING while m.PostInit() may already trigger check-ins;
fix by setting beatReady to true before calling m.PostInit (or alternatively
relax the CheckinV2Impl assertion to only enforce STARTING when beatReady is
false and not before m.PostInit runs). Update the test to call
beatReady.Store(true) (or adjust the CheckinV2Impl logic around beatReady) so
there is no timing window between m.PostInit() and the assertion in
CheckinV2Impl.
In `@x-pack/libbeat/management/managerV2.go`:
- Around line 373-380: The timer created by time.NewTimer(timeout) (variable t)
is not stopped when the done channel fires; modify the select branch that
handles <-done to call t.Stop() and, to be safe, drain the timer if Stop returns
false (i.e., the timer already fired) before returning true. Update the select
so the case <-done path invokes t.Stop() (and drains t.C when needed) to release
timer resources and then returns true.
- Around line 94-95: stopWait is declared as sync.WaitGroup but the code calls
stopWait.Go(), which doesn't exist; change stopWait to an errgroup.Group and
update its usage: replace the declaration "stopWait sync.WaitGroup" with
"stopWait errgroup.Group" (import "golang.org/x/sync/errgroup"), change the
stopWait.Go(...) calls in PreInit() to stopWait.Go(...) using the errgroup
signature, and replace any stopWait.Wait() calls with stopWait.Wait() from
errgroup (handling returned error) or, alternatively, if you prefer
sync.WaitGroup keep the current calls but replace stopWait.Go(...) with
stopWait.Add(1) followed by launching a goroutine that defers stopWait.Done()
inside PreInit().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 55e41946-b5dc-4634-bfc5-1b6debdc5722
📒 Files selected for processing (11)
changelog/fragments/1774984035-move-manager-start.yamlfilebeat/beater/filebeat.golibbeat/beat/beat_test.golibbeat/cmd/instance/beat_test.golibbeat/management/management.gox-pack/filebeat/tests/integration/managerV2_test.gox-pack/filebeat/tests/integration/status_reporter_test.gox-pack/libbeat/management/managerV2.gox-pack/libbeat/management/managerV2_test.gox-pack/osquerybeat/beater/osquerybeat_status_test.gox-pack/otel/otelmanager/manager.go
| t := time.NewTimer(timeout) | ||
|
|
||
| select { | ||
| case <-done: | ||
| return true | ||
| case <-t.C: | ||
| return false | ||
| } |
There was a problem hiding this comment.
Timer not stopped on early return.
When done fires before the timeout, t.Stop() should be called to release timer resources.
Proposed fix
select {
case <-done:
+ t.Stop()
return true
case <-t.C:
return false
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| t := time.NewTimer(timeout) | |
| select { | |
| case <-done: | |
| return true | |
| case <-t.C: | |
| return false | |
| } | |
| t := time.NewTimer(timeout) | |
| select { | |
| case <-done: | |
| t.Stop() | |
| return true | |
| case <-t.C: | |
| return false | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@x-pack/libbeat/management/managerV2.go` around lines 373 - 380, The timer
created by time.NewTimer(timeout) (variable t) is not stopped when the done
channel fires; modify the select branch that handles <-done to call t.Stop()
and, to be safe, drain the timer if Stop returns false (i.e., the timer already
fired) before returning true. Update the select so the case <-done path invokes
t.Stop() (and drains t.C when needed) to release timer resources and then
returns true.
There was a problem hiding this comment.
Are you sure? Check the documentation for the Go version used by Beats.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the Go version used by Beats
cat .go-version 2>/dev/null || true
grep -r '^go ' go.mod | head -5
Length of output: 16
@belimawr, you're right — my suggestion was incorrect. Beats uses Go 1.25.8, and since Go 1.23 the behaviour changed significantly:
- As of Go 1.23, the garbage collector can recover unreferenced timers, even if they haven't expired or been stopped. The
Stopmethod is no longer necessary to help the garbage collector. - Timer channels are now synchronous (unbuffered), giving
t.Resetandt.Stopa stronger guarantee. Before Go 1.23, it was impossible to avoid stale values witht.Reset, and avoiding stale values witht.Stoprequired careful use of its return value. The Go 1.23 implementation removes this concern entirely.
The code in WaitForStop is correct as-is — no t.Stop() call is needed. I apologize for the noise.
✏️ Learnings added
Learnt from: belimawr
Repo: elastic/beats PR: 49796
File: x-pack/libbeat/management/managerV2.go:373-380
Timestamp: 2026-03-31T20:42:16.412Z
Learning: In Go 1.23+, time.Timer instances that are no longer referenced are garbage collected automatically without needing t.Stop(). Timer channels are also now synchronous (unbuffered), eliminating stale value concerns after t.Reset or t.Stop. Therefore, do not flag missing t.Stop() calls on time.NewTimer as resource leaks in Go 1.23+ codebases. Beats (elastic/beats) uses Go 1.25.8.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: ShourieG
Repo: elastic/beats PR: 48846
File: x-pack/filebeat/input/akamai/config.go:231-297
Timestamp: 2026-03-16T11:05:51.163Z
Learning: When reviewing Go code that uses cfg.Unpack(&dst) from the elastic-agent-libs config wrapper, understand that Unpack will call Validate() on dst if it implements the Validator interface and will enforce validate struct tags (e.g., validate:"required"). Do not require or flag an explicit dst.Validate() call after Unpack in factory methods like InputManager.Create(); rely on the upstream validation behavior. If a destination type should be validated, ensure it implements Validator or uses appropriate validate tags.
Learnt from: cmacknz
Repo: elastic/beats PR: 49796
File: x-pack/libbeat/management/managerV2.go:94-95
Timestamp: 2026-03-31T20:23:25.614Z
Learning: In Go 1.25+, `sync.WaitGroup` includes a `Go(f func())` method that runs `f` in a new goroutine and automatically performs the corresponding `Add(1)`/`Done()` bookkeeping. Therefore, when reviewing Go code, do not flag calls of the form `wg.Go(...)` as incorrect WaitGroup usage; they are valid and should be treated as correct Go 1.25+ code. (See `sync.WaitGroup.Go` in the Go standard library.)
There was a problem hiding this comment.
This thing was clearly trained on an older version of Go than we use (1.23 apparently).
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
BeatV2Manager into two methods, so Beats can reply to check-in in parallel to its initialisationBeatV2Manager Start into two methods, so Beats can reply to check-in in parallel to its initialisation
|
|
||
| // Once ready, previously buffered unit state should eventually be applied. | ||
| m.PostInit() | ||
| beatReady.Store(true) |
There was a problem hiding this comment.
Am I following this right that beatReady is what is gating the unit config even being sent to the manager, and not the Pre or Post init?
There was a problem hiding this comment.
Why is this boolean necessary, don't you want checkins to happen continuously regardless of if the PostInit has been called?
There was a problem hiding this comment.
Nop, beatReady only ensures we assert that all units are in STARTING until PostInit is called.
There was a problem hiding this comment.
OK yes it's on the observed not the expected.
Should beatReady.Store(true) get changed before m.PostInit() changes, so that a checking doens't happen between the two lines before the condition gets changed?
There was a problem hiding this comment.
beatReady exists so the test can ensure the Beat (well, the manager) and all units are in STARTING state util PostInit is called.
Even if the Beat is not applying the Units, it needs to reply the checkin, so the test ensures all units in the checkin response are in the STARTING state.
There was a problem hiding this comment.
Yes that I see, my last point is that once you call m.PostInit() the status can flip to something other than starting before you can call beatReady.Store(true). That is a very slim possibility.
cmacknz
left a comment
There was a problem hiding this comment.
Outstanding comments are basically nits, thanks
Thanks for the quick review Craig! I believe I've replied all of your comments. |
|
Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform) |
Proposed commit message
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesstresstest.shscript to run them under stress conditions and race detector to verify their stability../changelog/fragmentsusing the changelog tool.## Disruptive User Impact## Author's ChecklistHow to test this PR locally
Because this PR is a implementation detail change, there is no directly observable behaviour change. The best way to test is to run the new and existing tests.
Run the new test
Run the tests from the modified packages
Related issues
Manager.Start()before state store loading to prevent check-in timeout livelock #49512## Use cases## Screenshots## Logs