Skip to content

Conversation

@hawkingrei
Copy link
Member

@hawkingrei hawkingrei commented Dec 25, 2025

What problem does this PR solve?

Issue Number: close #65818

Problem Summary:

Analyze does not propagate cancellation context into RPC/NextRaw; killing the query can leave analyze workers blocked.

What changed and how does it work?

  • Pass SQLKiller-derived context into analyze workers and all V1 analyze column paths.
  • Replace context.TODO() with the propagated ctx in analyze V1 build/consume flow.
  • Add a failpoint-gated test to ensure analyze exits promptly after ctx cancellation.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 25, 2025
@hawkingrei
Copy link
Member Author

/retest

@codecov
Copy link

codecov bot commented Dec 25, 2025

Codecov Report

❌ Patch coverage is 72.88136% with 112 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.1469%. Comparing base (5de6f55) to head (ed61b41).
⚠️ Report is 30 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #65249        +/-   ##
================================================
+ Coverage   77.7888%   78.1469%   +0.3580%     
================================================
  Files          2000       1922        -78     
  Lines        545038     535993      -9045     
================================================
- Hits         423979     418862      -5117     
+ Misses       119397     116660      -2737     
+ Partials       1662        471      -1191     
Flag Coverage Δ
integration 44.1556% <36.8038%> (-4.0150%) ⬇️
unit 76.4101% <72.8813%> (-0.0186%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 48.8228% <ø> (-12.1487%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hawkingrei hawkingrei changed the title executor: fix analyze cannot be killed [WIP]executor: fix analyze cannot be killed Dec 25, 2025
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 25, 2025
@hawkingrei
Copy link
Member Author

/retest

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 31, 2025
@hawkingrei hawkingrei changed the title [WIP]executor: fix analyze cannot be killed executor: fix analyze cannot be killed Jan 26, 2026
@ti-chi-bot ti-chi-bot bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-tests-checked labels Jan 26, 2026
@hawkingrei hawkingrei force-pushed the fix_cannot_kill_analyze branch from e93e052 to 253f0d7 Compare January 26, 2026 08:52
Copilot AI review requested due to automatic review settings January 26, 2026 08:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make ANALYZE responsive to query kill/cancellation by propagating a SQLKiller-derived cancellation context into analyze workers and DistSQL/NextRaw paths, and adds a failpoint-based test to validate prompt exit on context cancellation.

Changes:

  • Add a SQLKiller-provided cancelable context (GetKillEventCtx) and propagate it through analyze execution paths.
  • Replace context.TODO() with a propagated ctx in several analyze V1/V2 build/consume flows and add ctx-aware worker loops.
  • Add a failpoint-gated unit test ensuring ANALYZE exits quickly after ctx cancellation.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pkg/util/sqlkiller/sqlkiller.go Introduces kill-event context support for cancellation propagation.
pkg/util/sqlkiller/BUILD.bazel Adds dependency needed for new sqlkiller error usage.
pkg/executor/analyze.go Threads kill-derived ctx into analyze workers.
pkg/executor/analyze_col.go Propagates ctx into analyze V1 build/NextRaw flow.
pkg/executor/analyze_col_v2.go Adds ctx plumbing/cancellation to analyze V2 sampling workers and NextRaw usage.
pkg/executor/analyze_idx.go Adds ctx parameters through index analyze call chain and cancellation checks.
pkg/distsql/distsql.go Adds failpoint to block until ctx cancellation for testing.
pkg/executor/test/analyzetest/analyze_test.go Adds unit test validating analyze cancellation behavior.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 26, 2026
Copilot AI review requested due to automatic review settings January 26, 2026 10:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

statsHandle.FinishAnalyzeJob(results.Job, nil, statistics.TableAnalysisJob)
totalResult.results[results.Ars[0].Hist[0].ID] = results
case <-ctx.Done():
err = ctx.Err()
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On cancellation, this path sets err = ctx.Err(), which loses the cancellation cause from SQLKiller.GetKillEventCtx (set via cancelFn(errKilled)). Consider using context.Cause(ctx) (falling back to ctx.Err()) so the caller can differentiate SQL kill vs plain context cancellation.

Suggested change
err = ctx.Err()
err = context.Cause(ctx)
if err == nil {
err = ctx.Err()
}

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 26, 2026 13:08
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings January 26, 2026 13:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comment on lines 651 to 658
// Unmarshal the data.
dataSize := int64(cap(data))
colResp := &tipb.AnalyzeColumnsResp{}
err := colResp.Unmarshal(data)
if err != nil {
resultCh <- &samplingMergeResult{err: err}
return
}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subMergeWorker returns early on Unmarshal error and sends the error to resultCh, but it doesn't clean up retCollector before returning. This can leak collector resources/memory on decode errors. Consider destroying/returning retCollector to the pool before returning in these early-exit error cases.

Copilot uses AI. Check for mistakes.
failpoint.Inject("mockAnalyzeRequestWaitForCancel", func(val failpoint.Value) {
if val.(bool) {
<-ctx.Done()
failpoint.Return(nil, ctx.Err())
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failpoint 'mockAnalyzeRequestWaitForCancel' returns ctx.Err(), which discards any cancellation cause set via context.WithCancelCause (e.g. SQLKiller’s specific interrupt error). Returning context.Cause(ctx) (falling back to ctx.Err() if nil) would preserve the intended kill reason in tests and callers using cancel causes.

Suggested change
failpoint.Return(nil, ctx.Err())
err := context.Cause(ctx)
if err == nil {
err = ctx.Err()
}
failpoint.Return(nil, err)

Copilot uses AI. Check for mistakes.
Comment on lines 683 to 698
case <-ctx.Done():
err := context.Cause(ctx)
if err != nil {
resultCh <- &samplingMergeResult{err: err}
return
}
err = ctx.Err()
if err != nil {
resultCh <- &samplingMergeResult{err: err}
return
}
if intest.InTest {
panic("this ctx should be canceled with the error")
}
resultCh <- &samplingMergeResult{err: errors.New("context canceled without error")}
return
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In subMergeWorker, the ctx.Done() branch returns after sending an error to resultCh, but the local retCollector is never DestroyAndPutToPool()'d (and any memory it accumulated is never released). This can leak collector objects/memory on cancellation. Ensure retCollector is cleaned up before returning on ctx cancellation (and similarly for other early-return error paths).

Copilot uses AI. Check for mistakes.
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cfzjywxk, terry1purcell for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment on lines 495 to 518
LOOP:
for panicCnt < samplingStatsConcurrency {
results, ok := <-resultsCh
if !ok {
break
}
if results.Err != nil {
err = results.Err
statsHandle.FinishAnalyzeJob(results.Job, err, statistics.TableAnalysisJob)
if isAnalyzeWorkerPanic(err) {
panicCnt++
select {
case results, ok := <-resultsCh:
if !ok {
break LOOP
}
continue
if results.Err != nil {
err = results.Err
statsHandle.FinishAnalyzeJob(results.Job, err, statistics.TableAnalysisJob)
if isAnalyzeWorkerPanic(err) {
panicCnt++
}
continue LOOP
}
statsHandle.FinishAnalyzeJob(results.Job, nil, statistics.TableAnalysisJob)
totalResult.results[results.Ars[0].Hist[0].ID] = results
case <-ctx.Done():
err = context.Cause(ctx)
if err == nil {
err = ctx.Err()
}
break LOOP
}
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In handleNDVForSpecialIndexes, the new <-ctx.Done() branch breaks out of the result-consumption loop without finishing all sub-index analyze jobs. Because jobs are inserted up-front (AddNewAnalyzeJob) and only finished when their result is read, an early break can leave rows in mysql.analyze_jobs stuck in pending/running for the remaining tasks. Consider removing the early break and letting the loop drain resultsCh until it is closed (the workers should return quickly because analyzeIndexNDVPushDown now uses the same ctx), or explicitly finishing any remaining jobs with the ctx error before returning.

Copilot uses AI. Check for mistakes.
}
}
}()
return ctx, func() {
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildAnalyzeKillCtx creates a cancelable context but the returned stop() only closes stopCh; it never calls the context's cancel function. Calling cancel (e.g., cancel(nil)) in stop() helps release context resources and ensures any ctx-derived work that might still be observing ctx.Done() can terminate promptly if stop() is invoked.

Suggested change
return ctx, func() {
return ctx, func() {
cancel(nil)

Copilot uses AI. Check for mistakes.
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 29, 2026

@hawkingrei: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen ed61b41 link true /test pull-unit-test-next-gen

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/statistics release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Analyze cannot be cancelled promptly

1 participant