[gpu] Enhance driver installer and update README for custom images, versions, and performance #1320

cjac · 2025-05-04T03:42:12Z

This pull request delivers substantial improvements to the GPU initialization action (gpu/install_gpu_driver.sh) and a complete overhaul of its documentation (gpu/README.md). Key enhancements include robust support for custom Dataproc image creation, expanded OS and software version compatibility, new metadata parameters for greater control, and significant performance/caching optimizations.

The analysis of script changes, generation of the initial PR description, and the iterative refinement of the README.md were developed with the assistance of Gemini Advanced (May 2025).

Core Script Enhancements (install_gpu_driver.sh):

Custom Image Deferred Configuration (Fixes [custom-images] install_gpu_driver.sh does not configure spark when run as startup script with custom-images #110 #1303):
- The script now intelligently detects if it's running during a custom image build (via invocation-type=custom-images metadata).
- If so, Hadoop/Spark-related configurations are deferred to the first boot of instances created from the custom image. This is managed by a new systemd service (dataproc-gpu-config.service), ensuring all base Dataproc components are present before configuration.
Expanded Compatibility:
- Broader support for CUDA versions (up to 12.6), corresponding NVIDIA drivers, cuDNN, and NCCL, leveraging an internal, frequently updated version compatibility matrix.
- Enhanced support for Dataproc 2.0+ images on Debian 10-12, Ubuntu 18.04-22.04, and Rocky Linux 8-9.
New & Enhanced Metadata Parameters:
- cuda-url, gpu-driver-url: Allow users to specify direct HTTP/HTTPS URLs to custom CUDA toolkit and NVIDIA driver .run files, overriding the script's default selection logic.
- include-pytorch, gpu-conda-env: Provide options for installing PyTorch and related ML libraries within a specified Conda environment.
- Secure Boot Parameters: private_secret_name, public_secret_name, secret_project, secret_version, modulus_md5sum for signing kernel modules when Secure Boot is enabled.
Installation & Configuration Logic Improvements:
- Prioritizes NVIDIA .run files or building drivers from source (e.g., NVIDIA's open-gpu-kernel-modules) for improved reliability and version flexibility.
- Builds NCCL from source to ensure optimal compatibility.
- Implements robust GCS caching for downloaded artifacts (drivers, CUDA, Conda environments) and compiled components (kernel modules, NCCL) in the dataproc-temp-bucket. This significantly speeds up subsequent runs and cluster provisioning.
- Installs and configures the NVIDIA Container Toolkit for GPU-enabled container workloads.
- Hardens SSHD configuration by default.
- Improved management of APT sources, GPG keys, and backports for Debian-based systems.
- Optimized use of RAMdisk for temporary files on instances with sufficient memory.
- Updated GPU monitoring agent installation to use a dedicated Python virtual environment.
Performance and Security:
- Cache Pre-warming: The first run on a new configuration (especially on smaller nodes) can be lengthy due to source compilation. Pre-warming the GCS cache on a larger, single-node instance is highly recommended to reduce subsequent cluster setup times drastically (e.g., from ~150 min to ~15-20 min in some cases for the init action itself).
- Reduced Attack Surface: When build artifacts are successfully retrieved from the cache, the script often bypasses the need to install build tools (gcc, kernel-devel, etc.) on cluster nodes, enhancing security.

Comprehensive README.md Overhaul (fixes #1267 ):

Complete Rewrite for Accuracy and Clarity: The documentation has been entirely updated to accurately reflect all current functionalities, metadata parameters, and best practices, using the README version (md5sum 2daece9a7841cc4f5a0997fecf68cbd7) as the structural and stylistic baseline.
Detailed Usage and Configuration:
- New "Default versions" section clarifies default CUDA selection logic (referencing NVIDIA's support matrix as a basis for the script's internal logic) and provides an updated table of example tested configurations and supported OS.
- Revised gcloud examples for creating clusters, including MIG setup and using custom driver/CUDA URLs.
Custom Image Creation Guidance: A new, clear section explains the invocation-type=custom-images mechanism for use with image building tools like generate_custom_image.py.
Exhaustive Metadata Parameter List: The "Metadata Parameters" section now comprehensively details all available options, their purpose, defaults, and usage notes (e.g., cuda-url/gpu-driver-url expect HTTP/HTTPS).
In-depth Feature Explanations:
- Expanded "Loading Built Kernel Module & Secure Boot" section detailing MOK management with GCP Secret Manager.
- Updated information on "GPU Scheduling in YARN," "cuDNN" installation, and NVIDIA Container Toolkit.
Performance and Caching Documentation: The "Important notes" section now includes detailed advice on cache pre-warming, expected run times, and the security benefits of caching.
Modernized Content: Removed outdated information and examples pertaining to older Dataproc/CUDA versions and previous GPU agent behaviors. The "Report Metrics" and "Troubleshooting" sections have been updated for current agent functionality.
Maintained Structure and Formatting: Where possible, the line wrapping and structural style of the baseline README were preserved to ensure consistency.

This PR significantly modernizes the GPU initialization action, making it more robust, flexible, configurable, and performant, especially for users building custom images or requiring specific software versions. The updated documentation provides clear and comprehensive guidance for all users.

… script, primarily to **resolve the issue of Spark and Hadoop configurations failing during custom image creation, as detailed in [GitHub Issue GoogleCloudDataproc#1303](GoogleCloudDataproc#1303 The core problem was that the script attempted to modify configuration files (like `spark-defaults.conf`) before they were created by `bdutil` during the image customization process. This PR implements the proposed solution by deferring these configuration steps until the first boot of the instance. * **Deferred Configuration for Custom Images:** * The script now detects if it's running in a custom image build context by checking the `invocation-type` metadata attribute. This is stored in the `IS_CUSTOM_IMAGE_BUILD` variable. * When `IS_CUSTOM_IMAGE_BUILD` is true, critical Hadoop and Spark configuration steps are no longer executed immediately. Instead, a new systemd service (`dataproc-gpu-config.service`) is generated and enabled. * This service is responsible for running a newly created script (`/usr/local/sbin/apply-dataproc-gpu-config.sh`) on the instance's first boot. This generated script now contains all the necessary logic for Hadoop/Spark/GPU configuration (moved into a `run_hadoop_spark_config` function). * This deferral mechanism **explicitly solves issue GoogleCloudDataproc#1303** by ensuring that configurations are applied only after the Dataproc environment, including necessary configuration files, has been fully initialized. * **Script Structure for Deferred Execution:** * The `main` function has been refactored. It now orchestrates the installation of drivers and core components as before. However, for Hadoop/Spark configurations, it either executes the `apply-dataproc-gpu-config.sh` script directly (if not a custom image build) or enables the systemd service to run it on first boot. * The `create_deferred_config_files` function is responsible for generating the systemd service unit and the `apply-dataproc-gpu-config.sh` script. This script is carefully constructed to include all necessary helper functions and variables from the main `install_gpu_driver.sh` script to run independently. * **Re-evaluation of Environment in Deferred Script:** The deferred script (`apply-dataproc-gpu-config.sh`) re-evaluates critical environment variables like `ROLE`, `SPARK_VERSION`, `gpu_count`, and `IS_MIG_ENABLED` at the time of its execution (first boot) to ensure accuracy. * **CUDA and Driver Updates:** * Added support for Dataproc image version "2.3", defaulting to CUDA version "12.6.3". * Improved robustness in `install_build_dependencies` for Rocky Linux with fallbacks for kernel package downloads. * **Error Handling and Robustness:** * Several commands, like `gsutil rm`, `pip cache purge`, and `wget` in `Workspace_mig_scripts`, have improved error handling or are wrapped in `execute_with_retries`. * Suppressed benign errors from `du` commands during cleanup. * Zeroing of free disk space is now more robust and conditional on custom image builds. * **Configuration and Installation Improvements:** * Dynamically sets `conda_root_path` based on `DATAPROC_IMAGE_VERSION`. * Corrected GPG key handling for the NVIDIA Container Toolkit repository on Debian systems. * Ensures `python3-venv` is installed for the GPU agent on newer Debian-based images. * Streamlined several configuration functions by removing redundant GPU count checks. * Ensures RAPIDS properties are added to `spark-defaults.conf` idempotently. * The `check_secure_boot` function now handles cases where `mokutil` might not be present and provides a clearer error for missing signing material. * The script entry point and preparation steps (`prepare_to_install`) are more clearly defined. By implementing a deferred configuration mechanism for custom image builds, this pull request directly addresses and **resolves the core problem outlined in GitHub issue GoogleCloudDataproc#1303**, ensuring that GPU-related Hadoop and Spark configurations are applied reliably.

cjac · 2025-05-04T05:06:05Z

/gcbrun

This commit comprehensively updates gpu/README.md to align with the current features, metadata, and behavior of install_gpu_driver.sh. Key updates to the README include: - Default Versions & Configurations: - Clarified that the script's internal version matrix is based on NVIDIA's guidance (e.g., Deep Learning Frameworks Support Matrix). - Updated example default CUDA versions for different Dataproc image series (2.0, 2.1, 2.2+). - Expanded the table of "Example Tested Configurations" with more recent and relevant CUDA/Driver/cuDNN/NCCL versions and their tested Dataproc image compatibility. - Updated the list of "Supported Operating Systems." - Usage Examples: - Revised `gcloud` examples for clarity, using current best practices (regionalized bucket paths, common GPU types). - Added a new example demonstrating the use of `cuda-url` and `gpu-driver-url` with HTTP/HTTPS URLs. - Updated the MIG (Multi-Instance GPU) example to correctly show `install_gpu_driver.sh` for base drivers and `mig.sh` (via `dataproc:startup.script.uri`) for MIG-specific setup. - Custom Image Creation: - Clarified the use of `invocation-type=custom-images` metadata, emphasizing it's set by image building tools (like `generate_custom_image.py`) and not by end-users creating clusters from scratch. - Provided a simplified example for `generate_custom_image.py`. - Feature Documentation: - Updated "GPU Scheduling in YARN" to reflect current configurations, including the RAPIDS Spark plugin. - Revised the "cuDNN" section for clarity on version selection and installation methods. - Significantly expanded "Loading Built Kernel Module & Secure Boot" with details on MOK key management via GCP Secret Manager, the role of `GoogleCloudDataproc/custom-images/examples/secure-boot/create-key-pair.sh`, and the `--no-shielded-secure-boot` workaround. - Metadata Parameters: - Ensured the list is comprehensive and descriptions are accurate for all current parameters, including: - `cuda-url` and `gpu-driver-url` (clarifying they expect HTTP/HTTPS URLs for `curl` fetching). - `include-pytorch` and `gpu-conda-env`. - `container-runtime`. - Full set of Secure Boot signing parameters. - Corrected default values where necessary (e.g., `install-gpu-agent` is now `true` by default). - Verification, Reporting, and Troubleshooting: - Updated verification commands. - Clarified that the "Report Metrics" section now refers to the automated agent (based on ml-on-gcp code) and that `create_gpu_metrics.py` is no longer used by this init action. - Revised troubleshooting tips to be more relevant to current issues. - Important Notes: - Added detailed "Performance & Caching" subsection explaining: - The GCS caching mechanism (`dataproc-temp-bucket`). - Potential long first-run times (up to 150 mins on small nodes) if compiling from source. - The recommendation and benefits of "pre-warming" the cache on a larger instance (reducing init action time to ~12-20 mins in some cases). - The security benefit of reduced attack surface when using cached artifacts (as build tools may not be needed). - Updated notes on SSHD hardening and APT source management. - Confirmed primary support for Dataproc 2.0+ images. - Formatting and Style: - Maintained the overall structure and line-wrapping style (aiming for ~80 columns) of the provided baseline README (md5sum 2daece9a7841cc4f5a0997fecf68cbd7) where feasible, while ensuring clarity and readability of the new and updated content.

cjac · 2025-05-06T10:23:15Z

/gcbrun

Addressed issues related to salesforce case #59134938

cjac mentioned this pull request May 4, 2025

GPU driver installation fails on 2.2.52-debian12 #1318

Open

cjac self-assigned this May 4, 2025

cjac changed the title ~~[gpu] Defer GPU config on custom images to resolve #1303~~ [gpu] Enhance driver installer and update README for custom images, versions, and performance May 4, 2025

cjac force-pushed the gpu-20250503 branch from 6a0578c to 6d00e01 Compare May 4, 2025 05:44

cjac requested a review from Deependra-Patel May 6, 2025 10:22

cjac requested review from singhravidutt and rrohanarora and removed request for singhravidutt May 20, 2025 23:58

better support for HTTP proxies

d37c5e0

Addressed issues related to salesforce case #59134938

cjac force-pushed the gpu-20250503 branch from 0e77d13 to d37c5e0 Compare June 2, 2025 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[gpu] Enhance driver installer and update README for custom images, versions, and performance #1320

[gpu] Enhance driver installer and update README for custom images, versions, and performance #1320

Uh oh!

cjac commented May 4, 2025 •

edited

Loading

Uh oh!

cjac commented May 4, 2025

Uh oh!

cjac commented May 6, 2025

Uh oh!

Uh oh!

[gpu] Enhance driver installer and update README for custom images, versions, and performance #1320

Are you sure you want to change the base?

[gpu] Enhance driver installer and update README for custom images, versions, and performance #1320

Uh oh!

Conversation

cjac commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjac commented May 4, 2025

Uh oh!

cjac commented May 6, 2025

Uh oh!

Uh oh!

cjac commented May 4, 2025 •

edited

Loading