-
Notifications
You must be signed in to change notification settings - Fork 6.1k
executor: fix analyze cannot be killed #65249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/retest |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #65249 +/- ##
================================================
+ Coverage 77.7888% 78.1469% +0.3580%
================================================
Files 2000 1922 -78
Lines 545038 535993 -9045
================================================
- Hits 423979 418862 -5117
+ Misses 119397 116660 -2737
+ Partials 1662 471 -1191
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/retest |
e93e052 to
253f0d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR aims to make ANALYZE responsive to query kill/cancellation by propagating a SQLKiller-derived cancellation context into analyze workers and DistSQL/NextRaw paths, and adds a failpoint-based test to validate prompt exit on context cancellation.
Changes:
- Add a SQLKiller-provided cancelable context (
GetKillEventCtx) and propagate it through analyze execution paths. - Replace
context.TODO()with a propagated ctx in several analyze V1/V2 build/consume flows and add ctx-aware worker loops. - Add a failpoint-gated unit test ensuring
ANALYZEexits quickly after ctx cancellation.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/util/sqlkiller/sqlkiller.go | Introduces kill-event context support for cancellation propagation. |
| pkg/util/sqlkiller/BUILD.bazel | Adds dependency needed for new sqlkiller error usage. |
| pkg/executor/analyze.go | Threads kill-derived ctx into analyze workers. |
| pkg/executor/analyze_col.go | Propagates ctx into analyze V1 build/NextRaw flow. |
| pkg/executor/analyze_col_v2.go | Adds ctx plumbing/cancellation to analyze V2 sampling workers and NextRaw usage. |
| pkg/executor/analyze_idx.go | Adds ctx parameters through index analyze call chain and cancellation checks. |
| pkg/distsql/distsql.go | Adds failpoint to block until ctx cancellation for testing. |
| pkg/executor/test/analyzetest/analyze_test.go | Adds unit test validating analyze cancellation behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
pkg/executor/analyze_col_v2.go
Outdated
| statsHandle.FinishAnalyzeJob(results.Job, nil, statistics.TableAnalysisJob) | ||
| totalResult.results[results.Ars[0].Hist[0].ID] = results | ||
| case <-ctx.Done(): | ||
| err = ctx.Err() |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On cancellation, this path sets err = ctx.Err(), which loses the cancellation cause from SQLKiller.GetKillEventCtx (set via cancelFn(errKilled)). Consider using context.Cause(ctx) (falling back to ctx.Err()) so the caller can differentiate SQL kill vs plain context cancellation.
| err = ctx.Err() | |
| err = context.Cause(ctx) | |
| if err == nil { | |
| err = ctx.Err() | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
| // Unmarshal the data. | ||
| dataSize := int64(cap(data)) | ||
| colResp := &tipb.AnalyzeColumnsResp{} | ||
| err := colResp.Unmarshal(data) | ||
| if err != nil { | ||
| resultCh <- &samplingMergeResult{err: err} | ||
| return | ||
| } |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subMergeWorker returns early on Unmarshal error and sends the error to resultCh, but it doesn't clean up retCollector before returning. This can leak collector resources/memory on decode errors. Consider destroying/returning retCollector to the pool before returning in these early-exit error cases.
| failpoint.Inject("mockAnalyzeRequestWaitForCancel", func(val failpoint.Value) { | ||
| if val.(bool) { | ||
| <-ctx.Done() | ||
| failpoint.Return(nil, ctx.Err()) |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failpoint 'mockAnalyzeRequestWaitForCancel' returns ctx.Err(), which discards any cancellation cause set via context.WithCancelCause (e.g. SQLKiller’s specific interrupt error). Returning context.Cause(ctx) (falling back to ctx.Err() if nil) would preserve the intended kill reason in tests and callers using cancel causes.
| failpoint.Return(nil, ctx.Err()) | |
| err := context.Cause(ctx) | |
| if err == nil { | |
| err = ctx.Err() | |
| } | |
| failpoint.Return(nil, err) |
| case <-ctx.Done(): | ||
| err := context.Cause(ctx) | ||
| if err != nil { | ||
| resultCh <- &samplingMergeResult{err: err} | ||
| return | ||
| } | ||
| err = ctx.Err() | ||
| if err != nil { | ||
| resultCh <- &samplingMergeResult{err: err} | ||
| return | ||
| } | ||
| if intest.InTest { | ||
| panic("this ctx should be canceled with the error") | ||
| } | ||
| resultCh <- &samplingMergeResult{err: errors.New("context canceled without error")} | ||
| return |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In subMergeWorker, the ctx.Done() branch returns after sending an error to resultCh, but the local retCollector is never DestroyAndPutToPool()'d (and any memory it accumulated is never released). This can leak collector objects/memory on cancellation. Ensure retCollector is cleaned up before returning on ctx cancellation (and similarly for other early-return error paths).
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
| LOOP: | ||
| for panicCnt < samplingStatsConcurrency { | ||
| results, ok := <-resultsCh | ||
| if !ok { | ||
| break | ||
| } | ||
| if results.Err != nil { | ||
| err = results.Err | ||
| statsHandle.FinishAnalyzeJob(results.Job, err, statistics.TableAnalysisJob) | ||
| if isAnalyzeWorkerPanic(err) { | ||
| panicCnt++ | ||
| select { | ||
| case results, ok := <-resultsCh: | ||
| if !ok { | ||
| break LOOP | ||
| } | ||
| continue | ||
| if results.Err != nil { | ||
| err = results.Err | ||
| statsHandle.FinishAnalyzeJob(results.Job, err, statistics.TableAnalysisJob) | ||
| if isAnalyzeWorkerPanic(err) { | ||
| panicCnt++ | ||
| } | ||
| continue LOOP | ||
| } | ||
| statsHandle.FinishAnalyzeJob(results.Job, nil, statistics.TableAnalysisJob) | ||
| totalResult.results[results.Ars[0].Hist[0].ID] = results | ||
| case <-ctx.Done(): | ||
| err = context.Cause(ctx) | ||
| if err == nil { | ||
| err = ctx.Err() | ||
| } | ||
| break LOOP | ||
| } |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In handleNDVForSpecialIndexes, the new <-ctx.Done() branch breaks out of the result-consumption loop without finishing all sub-index analyze jobs. Because jobs are inserted up-front (AddNewAnalyzeJob) and only finished when their result is read, an early break can leave rows in mysql.analyze_jobs stuck in pending/running for the remaining tasks. Consider removing the early break and letting the loop drain resultsCh until it is closed (the workers should return quickly because analyzeIndexNDVPushDown now uses the same ctx), or explicitly finishing any remaining jobs with the ctx error before returning.
| } | ||
| } | ||
| }() | ||
| return ctx, func() { |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buildAnalyzeKillCtx creates a cancelable context but the returned stop() only closes stopCh; it never calls the context's cancel function. Calling cancel (e.g., cancel(nil)) in stop() helps release context resources and ensures any ctx-derived work that might still be observing ctx.Done() can terminate promptly if stop() is invoked.
| return ctx, func() { | |
| return ctx, func() { | |
| cancel(nil) |
|
@hawkingrei: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #65818
Problem Summary:
Analyze does not propagate cancellation context into RPC/NextRaw; killing the query can leave analyze workers blocked.
What changed and how does it work?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.