Skip to content

Export AutoTuner inputs to output and fix GPU OOM detection #2071

Merged
parthosa merged 17 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2060
Apr 27, 2026
Merged

Export AutoTuner inputs to output and fix GPU OOM detection #2071
parthosa merged 17 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2060

Conversation

@parthosa
Copy link
Copy Markdown
Collaborator

@parthosa parthosa commented Apr 8, 2026

Fixes #2060

Changes

1. New app_level_recommendation_signals.csv

Per-app output (profiling and qualification) with a single wide row of derived signals that feed recommendation engines (AutoTuner, qualx, etc.). GPU-only signals are 0 for qualification (CPU event logs).

Column Meaning
appId application ID
numScanStagesWithGpuOom count of scan stages where failed tasks had GPU OOM (profiling only)
numGpuShuffleStagesWithContainerOom count of GPU shuffle stages where the YARN container was killed (ExecutorLostFailure + exit 137) (profiling only)

Example output (profiling on a GPU event log):

appId,numScanStagesWithGpuOom,numGpuShuffleStagesWithContainerOom
"app-20250305192829-0000",3,0

Both OOM counts require cross-referencing failed_tasks.csv × failed_stages.csv × scan-time/GpuShuffleExchange markers — not derivable from any single existing CSV, which is why they live in a dedicated file. Layout is wide (single row) to match the project convention for per-app CSVs; per-SQL and per-stage breakdowns are left as a follow-up if/when additional signals are added.

2. Add input_bytesRead_max to aggregated task-metrics CSVs

Based on offline discussion with @hirakendu, moved input bytes read to sql/stage/job level aggregated task metrics.

New input_bytesRead_max column appended to stage_level_aggregated_task_metrics.csv, sql_level_aggregated_task_metrics.csv, and job_level_aggregated_task_metrics.csv, alongside the existing _max columns (duration_max, peakExecutionMemory_max, resultSize_max). Computed inside TaskMetricsAccumRec in the same pass as the _sum aggregates.

File: rapids_4_spark_profile/app-xxxx-0000/sql_level_aggregated_task_metrics.csv

appID,sqlID,description,numTasks,Duration,diskBytesSpilled_sum,...,input_bytesRead_max,...
"app-20250305192829-0000",24,"query9",353,97639,15594,...,1484282882,...

File: rapids_4_spark_profile/app-xxxx-0000/stage_level_aggregated_task_metrics.csv

stageId,numTasks,Duration,...,input_bytesRead_max,...
35,200,75095,...,1484282882,...

AutoTuner's getMaxInput now derives the value from sqlTaskAggMetrics.map(_.inputBytesReadMax).max instead of the dedicated SQLMaxTaskInputSizes record — which has been deleted along with maxTaskInputSizeBytesPerSQL() and the corresponding fields on AggRawMetricsResult / ApplicationSummaryInfo.

Since max columnar exchange bytes is a sql plan/operator metric, will handle that in #2020.

3. shuffle_skew_check.csv for qualification

Qualification computes taskShuffleSkew in rawAggMetrics but never wrote it. Now emitted under the same schema profiling already uses; aggRawMetrics is computed once and reused.

4. Fix GPU OOM detection

Previous detection missed pre-24.02 JNI logs that use com.nvidia.spark.rapids.jni.SplitAndRetryOOM (no Gpu prefix). SparkRapidsOomExceptions.isGpuOom() now matches:

Set("GpuSplitAndRetryOOM", "GpuRetryOOM", "jni.GpuOOM",
    "jni.SplitAndRetryOOM", "jni.RetryOOM")

jni. prefix avoids CpuSplitAndRetryOOM/CpuRetryOOM; jni.GpuOOM anchor catches the base class without matching arbitrary GpuOOM* substrings.

5. AppInfoGpuOomCheck trait: BooleanSet[Long]

hasScanStagesWithGpuOom: BooleanscanStagesWithGpuOom: Set[Long] and hasShuffleStagesWithOom: BooleangpuShuffleStagesWithContainerOom: Set[Long]. AutoTuner uses .nonEmpty; the CSV lists the actual stage IDs. The gpuShuffle…ContainerOom naming makes the GPU-only gating explicit and distinguishes container-level (YARN SIGKILL) OOM from device-level GPU OOM. Shared compute* helpers moved to SingleAppSummaryInfoProvider companion for reuse.

Not changing

  • maxColumnarExchangeDataSizeBytes is no longer exposed as a CSV signal, but the in-memory provider method and AutoTuner's AQE partition adjustment (AutoTuner.scala) are unchanged — the AutoTuner reads it from in-memory sqlMetrics directly.

Testing

  • New OomDetectionSuite — table-driven tests covering current JNI classes, pre-24.02 classes, CPU OOM exclusion, and non-OOM failures.
  • BaseAutoTunerSuite mock updated to Set[Long] and renamed field.
  • ProfilingAutoTunerSuite / V2 call sites migrated from Boolean to Set[Long].
  • ApplicationInfoSuite expected CSV file counts bumped by 1 for the new app_level_recommendation_signals.csv.

@github-actions github-actions Bot added the core_tools Scope the core module (scala) label Apr 8, 2026
@parthosa parthosa force-pushed the rapids-tools-2060 branch from 861f674 to 0968cee Compare April 8, 2026 17:07
@parthosa parthosa self-assigned this Apr 8, 2026
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa force-pushed the rapids-tools-2060 branch from 0968cee to a3d9d75 Compare April 8, 2026 17:22
@parthosa parthosa marked this pull request as ready for review April 8, 2026 18:16
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 8, 2026

Greptile Summary

This PR extends the profiling and qualification output with a new app_level_recommendation_signals.csv (GPU OOM stage counts), adds input_bytesRead_max to all three aggregated task-metrics CSVs (replacing the dedicated SQLMaxTaskInputSizes single-pass), broadens GPU OOM detection to include pre-24.02 JNI class names (jni.SplitAndRetryOOM, jni.RetryOOM, jni.GpuOOM), and migrates AutoTuner's Boolean OOM flags to Set[Long] for richer signal data.

Confidence Score: 5/5

Safe to merge — no functional bugs found; only a P2 documentation mismatch between the PR description's stated CSV format (stage IDs) and the actual output (counts).

All P0/P1 concerns from previous review threads were addressed (jni.GpuOOM anchoring, Long dataType in YAML, Set[Long] migration). The inputBytesReadMax accumulation is correct and consistent with the existing max-field pattern. The getMaxInput semantic change is equivalent. The only finding is P2: the app_level_recommendation_signals.csv stores OOM stage counts rather than the comma-separated stage ID lists described in the PR description.

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala — AppLevelRecommendationSignalsProfileResult stores counts vs. stage IDs as described.

Important Files Changed

Filename Overview
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala Adds AppLevelRecommendationSignalsProfileResult (stores OOM counts, not stage IDs as described), expands BaseJobStageAggTaskMetricsProfileResult with inputBytesReadMax, removes SQLMaxTaskInputSizes, and extends SparkRapidsOomExceptions with jni.* prefixed class names and isGpuOom() helper.
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationSummaryInfo.scala Migrates hasScanStagesWithGpuOom/hasShuffleStagesWithOom Boolean → Set[Long], removes maxTaskInputBytesRead from ApplicationSummaryInfo, moves compute helpers to SingleAppSummaryInfoProvider companion object, and redirects getMaxInput to sqlTaskAggMetrics.inputBytesReadMax.
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/util/TaskMetricsAccumRec.scala Adds inputBytesReadMax field (initialized to Long.MinValue, reset to 0 in resetFields(), accumulated via math.max in addRecord/addAccumResult) — consistent with existing durationMax/peakExecutionMemoryMax/resultSizeMax pattern.
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala Threads inputBytesReadMax from accumulators into JobAgg/StageAgg/SQLAgg result records; removes maxTaskInputSizeBytesPerSQL() which iterated tasks a second time.
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala Eagerly computes GPU OOM stage sets for the tuning_signals CSV, stores them in ApplicationSummaryInfo.appLevelRecommendationSignals, and removes the maxTaskInputInfo conditional path.
core/src/main/scala/com/nvidia/spark/rapids/tool/tuning/AutoTuner.scala Migrates OOM guards from Boolean flag calls to Set.nonEmpty checks; no logic change in partition/shuffle-partition tuning recommendations.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala Hoists aggRawMetrics computation for reuse, adds APP_LEVEL_RECOMMENDATION_SIGNALS CSV with zeroed signals for qualification (CPU-only logs).
core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/OomDetectionSuite.scala New table-driven test suite covering all five GPU OOM class name patterns including pre-24.02 jni.* variants, CPU OOM exclusion, and non-OOM failures.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Event Log] --> B[Profiler.processApp]
    B --> C[CollectInformation]
    B --> D[AppSparkMetricsAnalyzer]
    B --> E[HealthCheck]

    C --> F[stageMetrics]
    E --> G[failedTasks / failedStages]
    D --> H[TaskMetricsAccumRec +inputBytesReadMax]
    H --> I[SQLTaskAggMetricsProfileResult +inputBytesReadMax]
    H --> J[StageAggTaskMetricsProfileResult +inputBytesReadMax]
    H --> K[JobAggTaskMetricsProfileResult +inputBytesReadMax]

    F --> L[computeScanStagesWithGpuOom]
    G --> L
    F --> M[computeShuffleStagesWithContainerOom]
    G --> M

    L --> N[AppLevelRecommendationSignals stores counts only]
    M --> N
    N --> O[app_level_recommendation_signals.csv]

    L --> P[scanStagesWithGpuOom Set]
    M --> Q[gpuShuffleStagesWithContainerOom Set]
    P --> R[AutoTuner .nonEmpty checks]
    Q --> R

    I --> S[getMaxInput = sqlAggs.map.inputBytesReadMax.max]
    S --> R
Loading

Reviews (8): Last reviewed commit: "Fix scalastyle line-length violation in ..." | Re-trigger Greptile

Comment thread core/src/main/resources/configs/reports/coreRawMetricsReport.yaml Outdated
- Changed data type of `maxTaskInputBytesRead` from Double to Long in `coreRawMetricsReport.yaml`.
- Refactored methods in `ApplicationSummaryInfo.scala` to read pre-computed values for GPU OOM error handling, enhancing performance and reducing duplicate computations.
- Updated comments in `ProfileClassWarehouse.scala` for clarity on GPU OOM exception handling.

These changes aim to improve the accuracy and efficiency of profiling metrics related to GPU memory management.
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 8, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@parthosa parthosa changed the title Enhance profiling metrics and fix GPU OOM detection in AutoTuner Export AutoTuner inputs to output and fix GPU OOM detection Apr 9, 2026
@parthosa
Copy link
Copy Markdown
Collaborator Author

Aligned with @amahussein offline, will create a separate file for these.

@parthosa parthosa marked this pull request as draft April 13, 2026 16:13
parthosa and others added 5 commits April 22, 2026 02:48
Move 4 AutoTuner input fields (maxTaskInputBytesRead,
maxColumnarExchangeDataSizeBytes, scanStagesWithGpuOom,
shuffleStagesWithOom) from application_information.csv into a new
app_tuning_metrics.csv file. These are tuning-specific signals that
don't belong in the core application info table.

Also addresses review feedback:
- Anchor GpuOOM pattern as jni.GpuOOM to avoid partial matches
- Change maxTaskInputBytesRead YAML dataType from Double to Long

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
…to 0

- Rename file from app_tuning_metrics.csv to application_tuning_metrics.csv
  (constant APP_TUNING_METRICS -> APPLICATION_TUNING_METRICS).
- Change maxTaskInputBytesRead and maxColumnarExchangeDataSizeBytes in
  AppTuningMetricsProfileResult to Long with default 0, so missing values
  render as "0" symmetrically instead of mixing empty strings and numbers.
- Add application_tuning_metrics.csv to the per-app output tree comment
  in coreRawMetricsReport.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
…ReportGenerator.scala to enhance code clarity and maintainability.
@parthosa parthosa marked this pull request as ready for review April 22, 2026 15:17
parthosa and others added 2 commits April 22, 2026 08:27
Keep the original Scallop default (false) to avoid changing CLI behavior.
Qualification always writes CSVs via ProfileOutputWriter(outputCSV = true)
so application_tuning_metrics.csv is still produced there unconditionally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
File is unchanged functionally vs dev; skip the auto-copyrighter hook
to keep the copyright year at 2025 since this PR does not modify the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa marked this pull request as draft April 22, 2026 16:11
parthosa and others added 2 commits April 22, 2026 11:02
- Rename output file application_tuning_metrics.csv -> tuning_signals.csv
  and switch to vertical metricName,value layout (matches spark_properties.csv).
  appId is dropped since the file lives in a per-app directory.
- Replace AppTuningMetricsProfileResult with a row-shaped TuningSignalProfileResult
  plus a builder that emits one row per metric. Simplifies adding new signals.
- Rename trait method shuffleStagesWithOom -> gpuShuffleStagesWithContainerOom
  to make the GPU-only gating explicit and distinguish container-level
  (YARN SIGKILL) OOM from device-level GPU OOM.
- Drop Gpu prefix on the static compute helper
  (computeShuffleStagesWithContainerOom) since it lives in the
  profiling-only SingleAppSummaryInfoProvider companion.
- Update YAML schema, OutHeaderRegistry, Profiler/QualRawReportGenerator
  call sites, AutoTuner consumer, and test mocks accordingly.
- ToolsAPI auto-discovers the new coreRawTuningSignalsCSV label via the
  YAML; no Python changes needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Shorter header matches the spark_properties.csv (propertyName, propertyValue)
pattern but with 'name' since the containing CSV already conveys the
'metric/signal' context via its filename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa marked this pull request as ready for review April 22, 2026 18:22
amahussein
amahussein previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Nice feature to add for the tools output.

…s.csv

- Add input_bytesRead_max column to stage/SQL/job aggregated task metrics
  CSVs alongside existing _max columns (duration_max, peakExecutionMemory_max,
  resultSize_max). Tracked in TaskMetricsAccumRec and flows through
  StageAggAccum / SQLAggAccum / JobAggAccum to the three result case classes.
- Drop maxTaskInputBytesRead and maxColumnarExchangeDataSizeBytes rows from
  tuning_signals.csv. The file now contains only the two GPU OOM signals.
- Remove SQLMaxTaskInputSizes case class, maxTaskInputSizeBytesPerSQL()
  helper, and the maxTaskInputSizes field from AggRawMetricsResult and
  ApplicationSummaryInfo — consumers (getMaxInput on both providers) now
  derive the value from sqlTaskAggMetrics/sqlAggs via inputBytesReadMax.
- AutoTuner's AQE ColumnarExchange adjustment is unchanged — it reads
  from in-memory sqlMetrics via SingleAppSummaryInfoProvider, not from CSV.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa marked this pull request as draft April 22, 2026 23:19
Updated the 9 expectation CSVs under ProfilingExpectations/ to include
the new inputBytesReadMax column produced by the stage/SQL/job task
metric aggregates. Values were regenerated by running AnalysisSuite
against the existing event logs and writing the actual DataFrames back
to the expectation files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Files touched by the tuning-signals relocation change had outdated
copyright headers that the CI header check flagged as expired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa
Copy link
Copy Markdown
Collaborator Author

parthosa commented Apr 22, 2026

Based on offline discussion with @hirakendu, moved input bytes read to sql/stage/job level aggregated task metrics.

File: rapids_4_spark_profile/app-xxxx-0000/sql_level_aggregated_task_metrics.csv

appID,sqlID,description,numTasks,Duration,diskBytesSpilled_sum,...,input_bytesRead_max,...
"app-20250305192829-0000",24,"query9",353,97639,15594,...,1484282882,...

File: rapids_4_spark_profile/app-xxxx-0000/stage_level_aggregated_task_metrics.csv

stageId,numTasks,Duration,...,input_bytesRead_max,...
35,200,75095,...,1484282882,...

Since max columnar exchange bytes is a sql plan/operator metric, will handle that in #2020.

@parthosa parthosa marked this pull request as ready for review April 22, 2026 23:55
@hirakendu
Copy link
Copy Markdown
Collaborator

Thanks @parthosa. Are we still creating a new tuning_signals.csv for metrics other than inputBytesRead_max? If so, we can consider renaming to ml_signals.csv, unless these are all gpu metrics that cannot be used for qualx model. I think we might consider adding some cpu job metrics as signals in future for qualx and tuneml-zeroshot.

Is the lowest/finest granularity possible for the new signals at app-level? It would be good to explicitly indicate the granularity for each level. By defualt convention, I guess it is implied to be app-level. But for the ML signals, we can have app_level_ml_signals.csv, sql_level_ml_signals.csv, stage_level_ml_signals.csv. For each signal, its good to have it starting from the lowest/finest possible and have aggregates at the upper levels. (We may have some stage filtering or sql filtering, or define metrics only for the heaviest stage/sql, which need the low-level aggregates.)

Replaces the 2-row vertical tuning_signals.csv with a single-row wide
app_level_recommendation_signals.csv. Addresses review feedback from
@hirakendu:
- Generic naming ("recommendation signals") so future qualx / tuneml
  features can co-locate their signals here.
- Explicit granularity in the filename so per-SQL / per-stage siblings
  can be added later without ambiguity.

Columns:
  appId
  numScanStagesWithGpuOom               (profiling only, 0 for qual)
  numGpuShuffleStagesWithContainerOom   (profiling only, 0 for qual)

Comma-separated stage-ID lists are replaced by simple counts, which are
aggregation-friendly and atomic. If downstream consumers need per-stage
detail, the stage IDs are still available via failed_tasks.csv /
failed_stages.csv cross-reference (the same data this file summarises).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa
Copy link
Copy Markdown
Collaborator Author

Thanks @hirakendu. Addressed them as:

  • Renamed to app_level_recommendation_signals.csv. "Recommendation signals" is
    consumer-agnostic (covers AutoTuner, qualx / tuneml-zeroshot and CPU-job signals
    later).
  • Kept the scope at app-level for this PR - the two current GPU-OOM signals are app-wide
    counts. Will include stage and sql level later based on requirements.
  • maxTaskInputBytesRead was moved into the existing
    aggregated CSVs as input_bytesRead_max (stage / SQL / job).

Sample output for app_level_recommendation_signals.csv

appId,numScanStagesWithGpuOom,numGpuShuffleStagesWithContainerOom
"spark-99cdacbe45224a82969c20779fd245bd",3,1

…ProfileResult.build

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@hirakendu
Copy link
Copy Markdown
Collaborator

LGTM, thanks for all the level-specific metrics!

@parthosa parthosa merged commit e70ce06 into NVIDIA:dev Apr 27, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core_tools Scope the core module (scala)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Export AutoTuner inputs to Tools output

4 participants