Skip to content

Aggregate GPU task metrics in the profiling tool#2088

Open
parthosa wants to merge 5 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2020
Open

Aggregate GPU task metrics in the profiling tool#2088
parthosa wants to merge 5 commits intoNVIDIA:devfrom
parthosa:rapids-tools-2020

Conversation

@parthosa
Copy link
Copy Markdown
Collaborator

@parthosa parthosa commented Apr 27, 2026

Contributes #2020

Changes

1. New GPU task-metric aggregation CSVs at three levels

Adds long-format aggregations for the 26 GPU task accumulators emitted by the RAPIDS plugin (GpuTaskMetrics.scala). Today these are only available raw in stage_level_all_metrics.csv.

Level Filename Columns
Stage gpu_stage_level_aggregated_task_metrics.csv stageId, numTasks, metricName, unit, sum, max, avg
SQL gpu_sql_level_aggregated_task_metrics.csv sqlId, metricName, unit, sum, max, avg
App gpu_app_level_aggregated_task_metrics.csv appId, metricName, unit, sum, max, avg

Note:

  • Job level skipped (each Spark action is a job — rows would either duplicate the SQL row or be meaningless setup/collect jobs).
  • numTasks only at stage level, where it varies; at SQL/app it would be a constant per row (already in sql_level_aggregated_task_metrics.csv).
  • CSVs not generated when no GPU metrics are present.

Example (SQL row):

sqlId,metricName,unit,sum,max,avg
24,gpuTime,ms,86643,897,237
24,gpuMaxDeviceMemoryBytes,bytes,,10124115675,

4. Max-aggregated metrics: AccumMetaRef.METRICS_WITH_MAX_AGGREGATES extended from 4 → 9 entries. For these, sum and avg are emitted empty; only max is meaningful.

Testing

  • AnalysisSuite — three new tests: rows produced for GPU log + rollup math; max-aggregated metrics carry only max; empty for CPU-only log.
  • E2E smoke on core/src/test/resources/spark-events-profiling/gpu_oom_eventlog.zstd: all three CSVs produced; cross-level math verifies.

Emits three new long-format CSVs covering the 26 GPU task accumulators
from GpuTaskMetrics.scala (gpu_stage_/sql_/app_level_aggregated_task_metrics.csv).
Auto-discovery by name (gpu*, perfio.s3.*, multithreadReaderMaxParallelism);
units derived from the name (Time/Wait→ms, Bytes→bytes, else count); SQL/app
levels re-sum stage rows. Skips emission when no GPU metrics are present.
Job level intentionally skipped (each Spark action is a job — would either
duplicate the SQL row or be meaningless).

Fixes NVIDIA#2020

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@github-actions github-actions Bot added the core_tools Scope the core module (scala) label Apr 27, 2026
@parthosa parthosa self-assigned this Apr 27, 2026
parthosa and others added 2 commits April 27, 2026 09:48
Adds appId as the leading column on gpu_app_level_aggregated_task_metrics.csv
so downstream consumers can join by application without relying on the output
directory path. Also bumps the copyright year on touched files to 2026 (the
pre-commit hook's sed is BSD-incompatible on macOS and silently no-ops).

Fixes NVIDIA#2020

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
ArrayBuffer.flatMap returns ArrayBuffer (mutable), which no longer
auto-coerces to immutable.Seq under Scala 2.13. Materialize the per-SQL
row collection as Seq before passing to rollupGpuRows, and use an explicit
lambda for the inner flatMap.

Fixes NVIDIA#2020

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa marked this pull request as ready for review April 27, 2026 20:13
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR adds long-format GPU task metric aggregations at three granularities (stage, SQL, app) by mining the existing GPU accumulators in app.accumManager. The rollup math — task-weighted averages, sum propagation, and max-only handling for 9 known max-aggregate metrics — is correctly implemented and well-tested.

The two previously flagged concerns remain open: the index parameter is still unused in aggregateGpuMetricsBySql / aggregateGpuMetricsByApp, and QualRawReportGenerator still emits all three GPU labels unconditionally (unlike the nonEmpty-guarded writes in Profiler.scala).

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/consistency issues that don't affect correctness.

All logic reviewed: rolling-average usage of stats.med is confirmed correct (AccumInfo uses running-mean semantics, with a TODO to rename it); weighted-average formula in rollupGpuRows is arithmetically sound; numTasks=0 fallback now emits a warning; CSV guards in Profiler.scala are correct. Only pre-flagged P2s remain open.

QualRawReportGenerator.scala — unconditional GPU label emission differs from Profiler.scala's nonEmpty guard.

Important Files Changed

Filename Overview
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala Core of the PR: adds GPU metric aggregation at stage/SQL/app levels. stats.med is the rolling average (correctly named per AccumInfo TODO), rollupGpuRows weighted-avg formula is correct, and numTasks=0 fallback now emits a warning.
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumMetaRef.scala Extends METRICS_WITH_MAX_AGGREGATES from 4 to 9 entries; new entries are consistent with max-aggregate semantics and match RAPIDS plugin GpuTaskMetrics.
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala Adds three new case classes (StageAggGpuMetricsProfileResult, SQLAggGpuMetricsProfileResult, AppAggGpuMetricsProfileResult) with correct Optional[Long] fields and CSV serialisation.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala GPU labels added unconditionally to the output map, unlike the nonEmpty-guarded writes in Profiler.scala; CPU-only apps will emit empty GPU entries in the qual path.
core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala Three new tests cover GPU log, max-aggregated metric semantics, and CPU-only empty output; rollup math assertions are thorough.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[app.accumManager] -->|filter isGpuMetric| B[GPU Accumulators]
    B -->|calculateAccStatsForStage| C[aggregateGpuMetricsByStage]
    C -->|StageAggGpuMetricsProfileResult| D[gpuStageRows]
    D -->|groupBy stageId| E[stageMap]
    F[app.sqlIdToStages] --> G[aggregateGpuMetricsBySql]
    E --> G
    G -->|rollupGpuRows per SQL| H[SQLAggGpuMetricsProfileResult]
    D -->|rollupGpuRows all stages| I[aggregateGpuMetricsByApp]
    I --> J[AppAggGpuMetricsProfileResult]
    D --> K[AggRawMetricsResult.gpuStageAggs]
    H --> L[AggRawMetricsResult.gpuSqlAggs]
    J --> M[AggRawMetricsResult.gpuAppAggs]
    K -->|nonEmpty guard| N[gpu_stage_level_aggregated_task_metrics.csv]
    L -->|nonEmpty guard| O[gpu_sql_level_aggregated_task_metrics.csv]
    M -->|nonEmpty guard| P[gpu_app_level_aggregated_task_metrics.csv]
Loading

Reviews (3): Last reviewed commit: "Address greptile review on PR #2088" | Re-trigger Greptile

parthosa and others added 2 commits April 30, 2026 16:14
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

# Conflicts:
#	core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AggRawMetricsResult.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAggTrait.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ApplicationSummaryInfo.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/Profiler.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/views/QualRawReportGenerator.scala
#	core/src/main/scala/com/nvidia/spark/rapids/tool/views/RawMetricProfView.scala
- Drop dead StageAggGpuMetricsProfileResult.aggregateStageProfileMetric.
  Stage attempts already merge upstream at the AccumInfo layer
  (stagesStatMap is keyed by stageId only, not stageId+attemptNumber),
  so a separate merge step on the case class is never invoked. Replaced
  the method with a comment explaining the upstream merging.
- Document the numTasks=0 invariant in aggregateGpuMetricsByStage and
  log a warning if the stage-task metrics cache lookup misses (which
  would silently distort the task-weighted avg at SQL/app level).

Fixes NVIDIA#2020

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Copy link
Copy Markdown
Collaborator

@hirakendu hirakendu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, super useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core_tools Scope the core module (scala)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants