Export AutoTuner inputs to output and fix GPU OOM detection #2071
Export AutoTuner inputs to output and fix GPU OOM detection #2071parthosa merged 17 commits intoNVIDIA:devfrom
Conversation
861f674 to
0968cee
Compare
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
0968cee to
a3d9d75
Compare
…la to reflect 2026
Greptile SummaryThis PR extends the profiling and qualification output with a new Confidence Score: 5/5Safe to merge — no functional bugs found; only a P2 documentation mismatch between the PR description's stated CSV format (stage IDs) and the actual output (counts). All P0/P1 concerns from previous review threads were addressed (jni.GpuOOM anchoring, Long dataType in YAML, Set[Long] migration). The inputBytesReadMax accumulation is correct and consistent with the existing max-field pattern. The getMaxInput semantic change is equivalent. The only finding is P2: the app_level_recommendation_signals.csv stores OOM stage counts rather than the comma-separated stage ID lists described in the PR description. core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala — AppLevelRecommendationSignalsProfileResult stores counts vs. stage IDs as described. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Event Log] --> B[Profiler.processApp]
B --> C[CollectInformation]
B --> D[AppSparkMetricsAnalyzer]
B --> E[HealthCheck]
C --> F[stageMetrics]
E --> G[failedTasks / failedStages]
D --> H[TaskMetricsAccumRec +inputBytesReadMax]
H --> I[SQLTaskAggMetricsProfileResult +inputBytesReadMax]
H --> J[StageAggTaskMetricsProfileResult +inputBytesReadMax]
H --> K[JobAggTaskMetricsProfileResult +inputBytesReadMax]
F --> L[computeScanStagesWithGpuOom]
G --> L
F --> M[computeShuffleStagesWithContainerOom]
G --> M
L --> N[AppLevelRecommendationSignals stores counts only]
M --> N
N --> O[app_level_recommendation_signals.csv]
L --> P[scanStagesWithGpuOom Set]
M --> Q[gpuShuffleStagesWithContainerOom Set]
P --> R[AutoTuner .nonEmpty checks]
Q --> R
I --> S[getMaxInput = sqlAggs.map.inputBytesReadMax.max]
S --> R
Reviews (8): Last reviewed commit: "Fix scalastyle line-length violation in ..." | Re-trigger Greptile |
- Changed data type of `maxTaskInputBytesRead` from Double to Long in `coreRawMetricsReport.yaml`. - Refactored methods in `ApplicationSummaryInfo.scala` to read pre-computed values for GPU OOM error handling, enhancing performance and reducing duplicate computations. - Updated comments in `ProfileClassWarehouse.scala` for clarity on GPU OOM exception handling. These changes aim to improve the accuracy and efficiency of profiling metrics related to GPU memory management.
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
|
Aligned with @amahussein offline, will create a separate file for these. |
Move 4 AutoTuner input fields (maxTaskInputBytesRead, maxColumnarExchangeDataSizeBytes, scanStagesWithGpuOom, shuffleStagesWithOom) from application_information.csv into a new app_tuning_metrics.csv file. These are tuning-specific signals that don't belong in the core application info table. Also addresses review feedback: - Anchor GpuOOM pattern as jni.GpuOOM to avoid partial matches - Change maxTaskInputBytesRead YAML dataType from Double to Long Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
…to 0 - Rename file from app_tuning_metrics.csv to application_tuning_metrics.csv (constant APP_TUNING_METRICS -> APPLICATION_TUNING_METRICS). - Change maxTaskInputBytesRead and maxColumnarExchangeDataSizeBytes in AppTuningMetricsProfileResult to Long with default 0, so missing values render as "0" symmetrically instead of mixing empty strings and numbers. - Add application_tuning_metrics.csv to the per-app output tree comment in coreRawMetricsReport.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
…ReportGenerator.scala to enhance code clarity and maintainability.
Keep the original Scallop default (false) to avoid changing CLI behavior. Qualification always writes CSVs via ProfileOutputWriter(outputCSV = true) so application_tuning_metrics.csv is still produced there unconditionally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
File is unchanged functionally vs dev; skip the auto-copyrighter hook to keep the copyright year at 2025 since this PR does not modify the file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
- Rename output file application_tuning_metrics.csv -> tuning_signals.csv and switch to vertical metricName,value layout (matches spark_properties.csv). appId is dropped since the file lives in a per-app directory. - Replace AppTuningMetricsProfileResult with a row-shaped TuningSignalProfileResult plus a builder that emits one row per metric. Simplifies adding new signals. - Rename trait method shuffleStagesWithOom -> gpuShuffleStagesWithContainerOom to make the GPU-only gating explicit and distinguish container-level (YARN SIGKILL) OOM from device-level GPU OOM. - Drop Gpu prefix on the static compute helper (computeShuffleStagesWithContainerOom) since it lives in the profiling-only SingleAppSummaryInfoProvider companion. - Update YAML schema, OutHeaderRegistry, Profiler/QualRawReportGenerator call sites, AutoTuner consumer, and test mocks accordingly. - ToolsAPI auto-discovers the new coreRawTuningSignalsCSV label via the YAML; no Python changes needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Shorter header matches the spark_properties.csv (propertyName, propertyValue) pattern but with 'name' since the containing CSV already conveys the 'metric/signal' context via its filename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
amahussein
left a comment
There was a problem hiding this comment.
LGTM!
Nice feature to add for the tools output.
…s.csv - Add input_bytesRead_max column to stage/SQL/job aggregated task metrics CSVs alongside existing _max columns (duration_max, peakExecutionMemory_max, resultSize_max). Tracked in TaskMetricsAccumRec and flows through StageAggAccum / SQLAggAccum / JobAggAccum to the three result case classes. - Drop maxTaskInputBytesRead and maxColumnarExchangeDataSizeBytes rows from tuning_signals.csv. The file now contains only the two GPU OOM signals. - Remove SQLMaxTaskInputSizes case class, maxTaskInputSizeBytesPerSQL() helper, and the maxTaskInputSizes field from AggRawMetricsResult and ApplicationSummaryInfo — consumers (getMaxInput on both providers) now derive the value from sqlTaskAggMetrics/sqlAggs via inputBytesReadMax. - AutoTuner's AQE ColumnarExchange adjustment is unchanged — it reads from in-memory sqlMetrics via SingleAppSummaryInfoProvider, not from CSV. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Updated the 9 expectation CSVs under ProfilingExpectations/ to include the new inputBytesReadMax column produced by the stage/SQL/job task metric aggregates. Values were regenerated by running AnalysisSuite against the existing event logs and writing the actual DataFrames back to the expectation files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Files touched by the tuning-signals relocation change had outdated copyright headers that the CI header check flagged as expired. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
|
Based on offline discussion with @hirakendu, moved input bytes read to sql/stage/job level aggregated task metrics. File: appID,sqlID,description,numTasks,Duration,diskBytesSpilled_sum,...,input_bytesRead_max,...
"app-20250305192829-0000",24,"query9",353,97639,15594,...,1484282882,...File: stageId,numTasks,Duration,...,input_bytesRead_max,...
35,200,75095,...,1484282882,...Since max columnar exchange bytes is a sql plan/operator metric, will handle that in #2020. |
|
Thanks @parthosa. Are we still creating a new Is the lowest/finest granularity possible for the new signals at app-level? It would be good to explicitly indicate the granularity for each level. By defualt convention, I guess it is implied to be app-level. But for the ML signals, we can have |
Replaces the 2-row vertical tuning_signals.csv with a single-row wide app_level_recommendation_signals.csv. Addresses review feedback from @hirakendu: - Generic naming ("recommendation signals") so future qualx / tuneml features can co-locate their signals here. - Explicit granularity in the filename so per-SQL / per-stage siblings can be added later without ambiguity. Columns: appId numScanStagesWithGpuOom (profiling only, 0 for qual) numGpuShuffleStagesWithContainerOom (profiling only, 0 for qual) Comma-separated stage-ID lists are replaced by simple counts, which are aggregation-friendly and atomic. If downstream consumers need per-stage detail, the stage IDs are still available via failed_tasks.csv / failed_stages.csv cross-reference (the same data this file summarises). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
|
Thanks @hirakendu. Addressed them as:
Sample output for appId,numScanStagesWithGpuOom,numGpuShuffleStagesWithContainerOom
"spark-99cdacbe45224a82969c20779fd245bd",3,1 |
…ProfileResult.build Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
|
LGTM, thanks for all the level-specific metrics! |
Fixes #2060
Changes
1. New
app_level_recommendation_signals.csvPer-app output (profiling and qualification) with a single wide row of derived signals that feed recommendation engines (AutoTuner, qualx, etc.). GPU-only signals are
0for qualification (CPU event logs).appIdnumScanStagesWithGpuOomnumGpuShuffleStagesWithContainerOomExample output (profiling on a GPU event log):
Both OOM counts require cross-referencing
failed_tasks.csv×failed_stages.csv× scan-time/GpuShuffleExchange markers — not derivable from any single existing CSV, which is why they live in a dedicated file. Layout is wide (single row) to match the project convention for per-app CSVs; per-SQL and per-stage breakdowns are left as a follow-up if/when additional signals are added.2. Add
input_bytesRead_maxto aggregated task-metrics CSVsBased on offline discussion with @hirakendu, moved input bytes read to sql/stage/job level aggregated task metrics.
New
input_bytesRead_maxcolumn appended tostage_level_aggregated_task_metrics.csv,sql_level_aggregated_task_metrics.csv, andjob_level_aggregated_task_metrics.csv, alongside the existing_maxcolumns (duration_max,peakExecutionMemory_max,resultSize_max). Computed insideTaskMetricsAccumRecin the same pass as the_sumaggregates.File:
rapids_4_spark_profile/app-xxxx-0000/sql_level_aggregated_task_metrics.csvFile:
rapids_4_spark_profile/app-xxxx-0000/stage_level_aggregated_task_metrics.csvAutoTuner's
getMaxInputnow derives the value fromsqlTaskAggMetrics.map(_.inputBytesReadMax).maxinstead of the dedicatedSQLMaxTaskInputSizesrecord — which has been deleted along withmaxTaskInputSizeBytesPerSQL()and the corresponding fields onAggRawMetricsResult/ApplicationSummaryInfo.Since max columnar exchange bytes is a sql plan/operator metric, will handle that in #2020.
3.
shuffle_skew_check.csvfor qualificationQualification computes
taskShuffleSkewinrawAggMetricsbut never wrote it. Now emitted under the same schema profiling already uses;aggRawMetricsis computed once and reused.4. Fix GPU OOM detection
Previous detection missed pre-24.02 JNI logs that use
com.nvidia.spark.rapids.jni.SplitAndRetryOOM(noGpuprefix).SparkRapidsOomExceptions.isGpuOom()now matches:jni.prefix avoidsCpuSplitAndRetryOOM/CpuRetryOOM;jni.GpuOOManchor catches the base class without matching arbitraryGpuOOM*substrings.5.
AppInfoGpuOomChecktrait:Boolean→Set[Long]hasScanStagesWithGpuOom: Boolean→scanStagesWithGpuOom: Set[Long]andhasShuffleStagesWithOom: Boolean→gpuShuffleStagesWithContainerOom: Set[Long]. AutoTuner uses.nonEmpty; the CSV lists the actual stage IDs. ThegpuShuffle…ContainerOomnaming makes the GPU-only gating explicit and distinguishes container-level (YARN SIGKILL) OOM from device-level GPU OOM. Sharedcompute*helpers moved toSingleAppSummaryInfoProvidercompanion for reuse.Not changing
maxColumnarExchangeDataSizeBytesis no longer exposed as a CSV signal, but the in-memory provider method and AutoTuner's AQE partition adjustment (AutoTuner.scala) are unchanged — the AutoTuner reads it from in-memorysqlMetricsdirectly.Testing
OomDetectionSuite— table-driven tests covering current JNI classes, pre-24.02 classes, CPU OOM exclusion, and non-OOM failures.BaseAutoTunerSuitemock updated toSet[Long]and renamed field.ProfilingAutoTunerSuite/V2call sites migrated fromBooleantoSet[Long].ApplicationInfoSuiteexpected CSV file counts bumped by 1 for the newapp_level_recommendation_signals.csv.