Drop AutoTuner recommendation for concurrentGpuTasks for apps using plugin >= 25.06#2090
Drop AutoTuner recommendation for concurrentGpuTasks for apps using plugin >= 25.06#2090parthosa wants to merge 1 commit intoNVIDIA:devfrom
Conversation
Starting with plugin 25.06, the RAPIDS plugin auto-tunes the number of concurrent GPU tasks based on memory usage (NVIDIA/spark-rapids#12374), so the AutoTuner should stop recommending `spark.rapids.sql.concurrentGpuTasks` for apps using that plugin version or later. Target cluster `enforced` and `preserve` overrides still take precedence. Fixes NVIDIA#2089 Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
03ebd43 to
1286dcc
Compare
Greptile SummaryThis PR stops the AutoTuner from recommending Confidence Score: 4/5Safe to merge; all findings are P2 suggestions with no impact on the main happy path. The core logic is correct and well-tested for the primary cases. Two P2 observations: compareVersions returning 0 on failure could theoretically cause an incorrect skip (extremely unlikely with a hardcoded constant), and the preserve branch of isPropertyUserOverridden lacks a dedicated test. No files require special attention; both findings are minor improvements. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[calculateClusterLevelRecommendations] --> B{isPropertyUserOverridden\nconcurrentGpuTasks?}
B -- enforced or preserve --> C[appendRecommendation\nwith enforced/calculated value]
B -- no override --> D{isConcurrentGpuTasksAutoTunedByPlugin?}
D --> E[getRapidsPluginJarVersion]
E --> F{Unique version\nextracted?}
F -- None / multiple distinct --> G[return false]
F -- Some version --> H{compareVersions >= 25.06.0?}
H -- yes --> I[return true]
H -- no --> G
G --> C
I --> J[skippedRecommendations += concurrentGpuTasks\nno recommendation emitted]
Reviews (1): Last reviewed commit: "Drop concurrentGpuTasks recommendation f..." | Re-trigger Greptile |
| enforcedProps = Map("spark.rapids.sql.concurrentGpuTasks" -> "4")) | ||
| assert(output.contains("spark.rapids.sql.concurrentGpuTasks=4"), | ||
| s"Expected enforced concurrentGpuTasks=4 to be present, got:\n$output") | ||
| } | ||
|
|
||
| test("No recommendation when the jar pluginJar is up-to-date") { |
There was a problem hiding this comment.
Missing test for
preserve override path
isPropertyUserOverridden has two branches — getUserEnforcedSparkProperty (enforced) and isPropertyPreserved (preserve) — but only the enforced branch is exercised by the new tests. The runConcurrentGpuTasksScenario helper already accepts preserveProps, so a fifth test can close this gap:
test("Target cluster preserve concurrentGpuTasks overrides plugin >= 25.06 drop") {
val output = runConcurrentGpuTasksScenario(
Seq("rapids-4-spark_2.12-25.08.0.jar"),
preserveProps = List("spark.rapids.sql.concurrentGpuTasks"))
assert(output.contains("spark.rapids.sql.concurrentGpuTasks"),
s"Expected preserved concurrentGpuTasks to be present, got:\n$output")
}| private def isConcurrentGpuTasksAutoTunedByPlugin: Boolean = { | ||
| getRapidsPluginJarVersion.exists { jarVer => | ||
| ToolUtils.compareVersions(jarVer, autoTunerHelper.pluginVersionAutoConcurrentGpuTasks) >= 0 | ||
| } | ||
| } |
There was a problem hiding this comment.
compareVersions returns 0 on failure, silently skipping recommendation
ToolUtils.compareVersions catches any exception and returns 0 (treating the two versions as equal). Because the check is >= 0, a comparison failure is interpreted as "version is at the threshold", and the recommendation is incorrectly dropped. The pluginVersionAutoConcurrentGpuTasks constant is well-formed so this is very unlikely in practice, but a defensive fallback would make the intent explicit and prevent silent misbehavior on unusual version strings.
Fixes #2089
Changes
Starting with plugin 25.06, the RAPIDS plugin auto-tunes the number of concurrent GPU tasks based on memory usage (NVIDIA/spark-rapids#12374). The AutoTuner should no longer recommend
spark.rapids.sql.concurrentGpuTasksfor apps using that plugin version or later.Logic (in
AutoTuner.calculateClusterLevelRecommendations):appInfoProvider.getRapidsJarsusing the existingpluginJarRegEx.>= 25.06.0, skip the recommendation by adding the key toskippedRecommendations. This also suppresses the "was not set" missing comment.enforcedandpreserveoverrides take precedence — recommendation is still emitted in those cases.New helper:
Platform.isPropertyUserOverridden(key)— returns true if the property is inenforcedorpreserve. Lives next togetUserEnforcedSparkProperty/isPropertyPreserved/isPropertyExcluded.Cleanup: dropped an unnecessary
.toIntoncalcGpuConcTasks()—appendRecommendationalready has aLongoverload.Testing
ProfilingAutoTunerSuite— 4 new tests:spark.rapids.sql.concurrentGpuTasksfor plugin 25.06.0enforcedvalue wins over the drop logicFull core test suite (669 tests across 29 suites) passes.