Aggregate GPU task metrics in the profiling tool by parthosa · Pull Request #2088 · NVIDIA/spark-rapids-tools

parthosa · 2026-04-27T16:45:35Z

Contributes #2020

Changes

1. New GPU task-metric aggregation CSVs at three levels

Adds long-format aggregations for the 26 GPU task accumulators emitted by the RAPIDS plugin (GpuTaskMetrics.scala). Today these are only available raw in stage_level_all_metrics.csv.

Level	Filename	Columns
Stage	`gpu_stage_level_aggregated_task_metrics.csv`	`stageId, numTasks, metricName, unit, sum, max, avg`
SQL	`gpu_sql_level_aggregated_task_metrics.csv`	`sqlId, metricName, unit, sum, max, avg`
App	`gpu_app_level_aggregated_task_metrics.csv`	`appId, metricName, unit, sum, max, avg`

Note:

Job level skipped (each Spark action is a job — rows would either duplicate the SQL row or be meaningless setup/collect jobs).
numTasks only at stage level, where it varies; at SQL/app it would be a constant per row (already in sql_level_aggregated_task_metrics.csv).
CSVs not generated when no GPU metrics are present.

Example (SQL row):

sqlId,metricName,unit,sum,max,avg
24,gpuTime,ms,86643,897,237
24,gpuMaxDeviceMemoryBytes,bytes,,10124115675,

4. Max-aggregated metrics: AccumMetaRef.METRICS_WITH_MAX_AGGREGATES extended from 4 → 9 entries. For these, sum and avg are emitted empty; only max is meaningful.

Testing

AnalysisSuite — three new tests: rows produced for GPU log + rollup math; max-aggregated metrics carry only max; empty for CPU-only log.
E2E smoke on core/src/test/resources/spark-events-profiling/gpu_oom_eventlog.zstd: all three CSVs produced; cross-level math verifies.

Emits three new long-format CSVs covering the 26 GPU task accumulators from GpuTaskMetrics.scala (gpu_stage_/sql_/app_level_aggregated_task_metrics.csv). Auto-discovery by name (gpu*, perfio.s3.*, multithreadReaderMaxParallelism); units derived from the name (Time/Wait→ms, Bytes→bytes, else count); SQL/app levels re-sum stage rows. Skips emission when no GPU metrics are present. Job level intentionally skipped (each Spark action is a job — would either duplicate the SQL row or be meaningless). Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

Adds appId as the leading column on gpu_app_level_aggregated_task_metrics.csv so downstream consumers can join by application without relying on the output directory path. Also bumps the copyright year on touched files to 2026 (the pre-commit hook's sed is BSD-incompatible on macOS and silently no-ops). Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

ArrayBuffer.flatMap returns ArrayBuffer (mutable), which no longer auto-coerces to immutable.Seq under Scala 2.13. Materialize the per-SQL row collection as Seq before passing to rollupGpuRows, and use an explicit lambda for the inner flatMap. Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

greptile-apps · 2026-04-27T20:18:27Z

Greptile Summary

This PR adds long-format GPU task metric aggregations at three granularities (stage, SQL, app) by mining the existing GPU accumulators in app.accumManager. The rollup math — task-weighted averages, sum propagation, and max-only handling for 9 known max-aggregate metrics — is correctly implemented and well-tested.

The two previously flagged concerns remain open: the index parameter is still unused in aggregateGpuMetricsBySql / aggregateGpuMetricsByApp, and QualRawReportGenerator still emits all three GPU labels unconditionally (unlike the nonEmpty-guarded writes in Profiler.scala).

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/consistency issues that don't affect correctness.

All logic reviewed: rolling-average usage of stats.med is confirmed correct (AccumInfo uses running-mean semantics, with a TODO to rename it); weighted-average formula in rollupGpuRows is arithmetically sound; numTasks=0 fallback now emits a warning; CSV guards in Profiler.scala are correct. Only pre-flagged P2s remain open.

QualRawReportGenerator.scala — unconditional GPU label emission differs from Profiler.scala's nonEmpty guard.

Important Files Changed

Filename	Overview
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala	Core of the PR: adds GPU metric aggregation at stage/SQL/app levels. `stats.med` is the rolling average (correctly named per AccumInfo TODO), `rollupGpuRows` weighted-avg formula is correct, and `numTasks=0` fallback now emits a warning.
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumMetaRef.scala	Extends METRICS_WITH_MAX_AGGREGATES from 4 to 9 entries; new entries are consistent with max-aggregate semantics and match RAPIDS plugin GpuTaskMetrics.
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala	Adds three new case classes (StageAggGpuMetricsProfileResult, SQLAggGpuMetricsProfileResult, AppAggGpuMetricsProfileResult) with correct Optional[Long] fields and CSV serialisation.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala	GPU labels added unconditionally to the output map, unlike the nonEmpty-guarded writes in Profiler.scala; CPU-only apps will emit empty GPU entries in the qual path.
core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala	Three new tests cover GPU log, max-aggregated metric semantics, and CPU-only empty output; rollup math assertions are thorough.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[app.accumManager] -->|filter isGpuMetric| B[GPU Accumulators]
    B -->|calculateAccStatsForStage| C[aggregateGpuMetricsByStage]
    C -->|StageAggGpuMetricsProfileResult| D[gpuStageRows]
    D -->|groupBy stageId| E[stageMap]
    F[app.sqlIdToStages] --> G[aggregateGpuMetricsBySql]
    E --> G
    G -->|rollupGpuRows per SQL| H[SQLAggGpuMetricsProfileResult]
    D -->|rollupGpuRows all stages| I[aggregateGpuMetricsByApp]
    I --> J[AppAggGpuMetricsProfileResult]
    D --> K[AggRawMetricsResult.gpuStageAggs]
    H --> L[AggRawMetricsResult.gpuSqlAggs]
    J --> M[AggRawMetricsResult.gpuAppAggs]
    K -->|nonEmpty guard| N[gpu_stage_level_aggregated_task_metrics.csv]
    L -->|nonEmpty guard| O[gpu_sql_level_aggregated_task_metrics.csv]
    M -->|nonEmpty guard| P[gpu_app_level_aggregated_task_metrics.csv]

_{Reviews (3): Last reviewed commit: "Address greptile review on PR #2088" | Re-trigger Greptile}

Signed-off-by: Partho Sarthi <psarthi@nvidia.com> # Conflicts: # core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AggRawMetricsResult.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAggTrait.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationSummaryInfo.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala # core/src/main/scala/com/nvidia/spark/rapids/tool/views/RawMetricProfView.scala

- Drop dead StageAggGpuMetricsProfileResult.aggregateStageProfileMetric. Stage attempts already merge upstream at the AccumInfo layer (stagesStatMap is keyed by stageId only, not stageId+attemptNumber), so a separate merge step on the case class is never invoked. Replaced the method with a comment explaining the upstream merging. - Document the numTasks=0 invariant in aggregateGpuMetricsByStage and log a warning if the stage-task metrics cache lookup misses (which would silently distort the task-weighted avg at SQL/app level). Fixes NVIDIA#2020 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

hirakendu

LGTM, super useful!

github-actions Bot added the core_tools Scope the core module (scala) label Apr 27, 2026

parthosa self-assigned this Apr 27, 2026

parthosa and others added 2 commits April 27, 2026 09:48

parthosa requested review from amahussein, hirakendu and sayedbilalbari April 27, 2026 20:13

parthosa marked this pull request as ready for review April 27, 2026 20:13

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala Outdated

Comment thread core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

parthosa and others added 2 commits April 30, 2026 16:14

hirakendu approved these changes May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate GPU task metrics in the profiling tool#2088

Aggregate GPU task metrics in the profiling tool#2088
parthosa wants to merge 5 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2020

parthosa commented Apr 27, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

hirakendu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

parthosa commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

greptile-apps Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

hirakendu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

parthosa commented Apr 27, 2026 •

edited

Loading

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading