Skip to content

Drop AutoTuner recommendation for concurrentGpuTasks for apps using plugin >= 25.06#2090

Open
parthosa wants to merge 1 commit intoNVIDIA:devfrom
parthosa:rapids-tools-2089
Open

Drop AutoTuner recommendation for concurrentGpuTasks for apps using plugin >= 25.06#2090
parthosa wants to merge 1 commit intoNVIDIA:devfrom
parthosa:rapids-tools-2089

Conversation

@parthosa
Copy link
Copy Markdown
Collaborator

Fixes #2089

Changes

Starting with plugin 25.06, the RAPIDS plugin auto-tunes the number of concurrent GPU tasks based on memory usage (NVIDIA/spark-rapids#12374). The AutoTuner should no longer recommend spark.rapids.sql.concurrentGpuTasks for apps using that plugin version or later.

Logic (in AutoTuner.calculateClusterLevelRecommendations):

  • Extract the unique plugin jar version from appInfoProvider.getRapidsJars using the existing pluginJarRegEx.
  • If version >= 25.06.0, skip the recommendation by adding the key to skippedRecommendations. This also suppresses the "was not set" missing comment.
  • Target cluster enforced and preserve overrides take precedence — recommendation is still emitted in those cases.
  • If the plugin jar version cannot be determined (no jars or multiple distinct versions), fall back to the existing recommendation logic.

New helper: Platform.isPropertyUserOverridden(key) — returns true if the property is in enforced or preserve. Lives next to getUserEnforcedSparkProperty / isPropertyPreserved / isPropertyExcluded.

Cleanup: dropped an unnecessary .toInt on calcGpuConcTasks()appendRecommendation already has a Long overload.

Testing

ProfilingAutoTunerSuite — 4 new tests:

  • Drops spark.rapids.sql.concurrentGpuTasks for plugin 25.06.0
  • Keeps recommendation for plugin 25.04.0
  • Keeps recommendation when no plugin jar found
  • Target cluster enforced value wins over the drop logic

Full core test suite (669 tests across 29 suites) passes.

@parthosa parthosa self-assigned this Apr 30, 2026
@github-actions github-actions Bot added the core_tools Scope the core module (scala) label Apr 30, 2026
Starting with plugin 25.06, the RAPIDS plugin auto-tunes the number of
concurrent GPU tasks based on memory usage (NVIDIA/spark-rapids#12374), so
the AutoTuner should stop recommending `spark.rapids.sql.concurrentGpuTasks`
for apps using that plugin version or later. Target cluster `enforced` and
`preserve` overrides still take precedence.

Fixes NVIDIA#2089

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa force-pushed the rapids-tools-2089 branch from 03ebd43 to 1286dcc Compare April 30, 2026 20:17
@parthosa parthosa marked this pull request as ready for review April 30, 2026 20:35
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR stops the AutoTuner from recommending spark.rapids.sql.concurrentGpuTasks when the profiled application uses a RAPIDS plugin version ≥ 25.06, since that version already auto-tunes the property at runtime. The change is well-scoped: a new isPropertyUserOverridden helper on Platform gates the skip behind enforced/preserve overrides, getRapidsPluginJarVersion de-duplicates jar entries before comparing, and the version threshold is a named constant.

Confidence Score: 4/5

Safe to merge; all findings are P2 suggestions with no impact on the main happy path.

The core logic is correct and well-tested for the primary cases. Two P2 observations: compareVersions returning 0 on failure could theoretically cause an incorrect skip (extremely unlikely with a hardcoded constant), and the preserve branch of isPropertyUserOverridden lacks a dedicated test.

No files require special attention; both findings are minor improvements.

Important Files Changed

Filename Overview
core/src/main/scala/com/nvidia/spark/rapids/tool/tuning/AutoTuner.scala Adds getRapidsPluginJarVersion and isConcurrentGpuTasksAutoTunedByPlugin helpers, and gates the concurrentGpuTasks recommendation on the plugin version; logic is sound with one minor edge case around compareVersions returning 0 on failure.
core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala Adds isPropertyUserOverridden as a clean composition of the two existing helpers; no issues found.
core/src/test/scala/com/nvidia/spark/rapids/tool/tuning/ProfilingAutoTunerSuite.scala Four new scenario tests cover the main cases; the preserve branch of isPropertyUserOverridden is not tested despite the helper supporting it.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[calculateClusterLevelRecommendations] --> B{isPropertyUserOverridden\nconcurrentGpuTasks?}
    B -- enforced or preserve --> C[appendRecommendation\nwith enforced/calculated value]
    B -- no override --> D{isConcurrentGpuTasksAutoTunedByPlugin?}
    D --> E[getRapidsPluginJarVersion]
    E --> F{Unique version\nextracted?}
    F -- None / multiple distinct --> G[return false]
    F -- Some version --> H{compareVersions >= 25.06.0?}
    H -- yes --> I[return true]
    H -- no --> G
    G --> C
    I --> J[skippedRecommendations += concurrentGpuTasks\nno recommendation emitted]
Loading

Reviews (1): Last reviewed commit: "Drop concurrentGpuTasks recommendation f..." | Re-trigger Greptile

Comment on lines +1667 to 1672
enforcedProps = Map("spark.rapids.sql.concurrentGpuTasks" -> "4"))
assert(output.contains("spark.rapids.sql.concurrentGpuTasks=4"),
s"Expected enforced concurrentGpuTasks=4 to be present, got:\n$output")
}

test("No recommendation when the jar pluginJar is up-to-date") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing test for preserve override path

isPropertyUserOverridden has two branches — getUserEnforcedSparkProperty (enforced) and isPropertyPreserved (preserve) — but only the enforced branch is exercised by the new tests. The runConcurrentGpuTasksScenario helper already accepts preserveProps, so a fifth test can close this gap:

test("Target cluster preserve concurrentGpuTasks overrides plugin >= 25.06 drop") {
  val output = runConcurrentGpuTasksScenario(
    Seq("rapids-4-spark_2.12-25.08.0.jar"),
    preserveProps = List("spark.rapids.sql.concurrentGpuTasks"))
  assert(output.contains("spark.rapids.sql.concurrentGpuTasks"),
    s"Expected preserved concurrentGpuTasks to be present, got:\n$output")
}

Comment on lines +647 to +651
private def isConcurrentGpuTasksAutoTunedByPlugin: Boolean = {
getRapidsPluginJarVersion.exists { jarVer =>
ToolUtils.compareVersions(jarVer, autoTunerHelper.pluginVersionAutoConcurrentGpuTasks) >= 0
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 compareVersions returns 0 on failure, silently skipping recommendation

ToolUtils.compareVersions catches any exception and returns 0 (treating the two versions as equal). Because the check is >= 0, a comparison failure is interpreted as "version is at the threshold", and the recommendation is incorrectly dropped. The pluginVersionAutoConcurrentGpuTasks constant is well-formed so this is very unlikely in practice, but a defensive fallback would make the intent explicit and prevent silent misbehavior on unusual version strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autotuner core_tools Scope the core module (scala)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Drop AutoTuner recommendation for concurrentGpuTasks for apps using plugin >= 25.06

2 participants