-
Notifications
You must be signed in to change notification settings - Fork 513
[gpu] Enhance driver installer and update README for custom images, versions, and performance #1320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cjac
wants to merge
3
commits into
GoogleCloudDataproc:master
Choose a base branch
from
LLC-Technologies-Collier:gpu-20250503
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[gpu] Enhance driver installer and update README for custom images, versions, and performance #1320
cjac
wants to merge
3
commits into
GoogleCloudDataproc:master
from
LLC-Technologies-Collier:gpu-20250503
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… script, primarily to **resolve the issue of Spark and Hadoop configurations failing during custom image creation, as detailed in [GitHub Issue GoogleCloudDataproc#1303](GoogleCloudDataproc#1303 The core problem was that the script attempted to modify configuration files (like `spark-defaults.conf`) before they were created by `bdutil` during the image customization process. This PR implements the proposed solution by deferring these configuration steps until the first boot of the instance. * **Deferred Configuration for Custom Images:** * The script now detects if it's running in a custom image build context by checking the `invocation-type` metadata attribute. This is stored in the `IS_CUSTOM_IMAGE_BUILD` variable. * When `IS_CUSTOM_IMAGE_BUILD` is true, critical Hadoop and Spark configuration steps are no longer executed immediately. Instead, a new systemd service (`dataproc-gpu-config.service`) is generated and enabled. * This service is responsible for running a newly created script (`/usr/local/sbin/apply-dataproc-gpu-config.sh`) on the instance's first boot. This generated script now contains all the necessary logic for Hadoop/Spark/GPU configuration (moved into a `run_hadoop_spark_config` function). * This deferral mechanism **explicitly solves issue GoogleCloudDataproc#1303** by ensuring that configurations are applied only after the Dataproc environment, including necessary configuration files, has been fully initialized. * **Script Structure for Deferred Execution:** * The `main` function has been refactored. It now orchestrates the installation of drivers and core components as before. However, for Hadoop/Spark configurations, it either executes the `apply-dataproc-gpu-config.sh` script directly (if not a custom image build) or enables the systemd service to run it on first boot. * The `create_deferred_config_files` function is responsible for generating the systemd service unit and the `apply-dataproc-gpu-config.sh` script. This script is carefully constructed to include all necessary helper functions and variables from the main `install_gpu_driver.sh` script to run independently. * **Re-evaluation of Environment in Deferred Script:** The deferred script (`apply-dataproc-gpu-config.sh`) re-evaluates critical environment variables like `ROLE`, `SPARK_VERSION`, `gpu_count`, and `IS_MIG_ENABLED` at the time of its execution (first boot) to ensure accuracy. * **CUDA and Driver Updates:** * Added support for Dataproc image version "2.3", defaulting to CUDA version "12.6.3". * Improved robustness in `install_build_dependencies` for Rocky Linux with fallbacks for kernel package downloads. * **Error Handling and Robustness:** * Several commands, like `gsutil rm`, `pip cache purge`, and `wget` in `Workspace_mig_scripts`, have improved error handling or are wrapped in `execute_with_retries`. * Suppressed benign errors from `du` commands during cleanup. * Zeroing of free disk space is now more robust and conditional on custom image builds. * **Configuration and Installation Improvements:** * Dynamically sets `conda_root_path` based on `DATAPROC_IMAGE_VERSION`. * Corrected GPG key handling for the NVIDIA Container Toolkit repository on Debian systems. * Ensures `python3-venv` is installed for the GPU agent on newer Debian-based images. * Streamlined several configuration functions by removing redundant GPU count checks. * Ensures RAPIDS properties are added to `spark-defaults.conf` idempotently. * The `check_secure_boot` function now handles cases where `mokutil` might not be present and provides a clearer error for missing signing material. * The script entry point and preparation steps (`prepare_to_install`) are more clearly defined. By implementing a deferred configuration mechanism for custom image builds, this pull request directly addresses and **resolves the core problem outlined in GitHub issue GoogleCloudDataproc#1303**, ensuring that GPU-related Hadoop and Spark configurations are applied reliably.
/gcbrun |
This commit comprehensively updates gpu/README.md to align with the current features, metadata, and behavior of install_gpu_driver.sh. Key updates to the README include: - Default Versions & Configurations: - Clarified that the script's internal version matrix is based on NVIDIA's guidance (e.g., Deep Learning Frameworks Support Matrix). - Updated example default CUDA versions for different Dataproc image series (2.0, 2.1, 2.2+). - Expanded the table of "Example Tested Configurations" with more recent and relevant CUDA/Driver/cuDNN/NCCL versions and their tested Dataproc image compatibility. - Updated the list of "Supported Operating Systems." - Usage Examples: - Revised `gcloud` examples for clarity, using current best practices (regionalized bucket paths, common GPU types). - Added a new example demonstrating the use of `cuda-url` and `gpu-driver-url` with HTTP/HTTPS URLs. - Updated the MIG (Multi-Instance GPU) example to correctly show `install_gpu_driver.sh` for base drivers and `mig.sh` (via `dataproc:startup.script.uri`) for MIG-specific setup. - Custom Image Creation: - Clarified the use of `invocation-type=custom-images` metadata, emphasizing it's set by image building tools (like `generate_custom_image.py`) and not by end-users creating clusters from scratch. - Provided a simplified example for `generate_custom_image.py`. - Feature Documentation: - Updated "GPU Scheduling in YARN" to reflect current configurations, including the RAPIDS Spark plugin. - Revised the "cuDNN" section for clarity on version selection and installation methods. - Significantly expanded "Loading Built Kernel Module & Secure Boot" with details on MOK key management via GCP Secret Manager, the role of `GoogleCloudDataproc/custom-images/examples/secure-boot/create-key-pair.sh`, and the `--no-shielded-secure-boot` workaround. - Metadata Parameters: - Ensured the list is comprehensive and descriptions are accurate for all current parameters, including: - `cuda-url` and `gpu-driver-url` (clarifying they expect HTTP/HTTPS URLs for `curl` fetching). - `include-pytorch` and `gpu-conda-env`. - `container-runtime`. - Full set of Secure Boot signing parameters. - Corrected default values where necessary (e.g., `install-gpu-agent` is now `true` by default). - Verification, Reporting, and Troubleshooting: - Updated verification commands. - Clarified that the "Report Metrics" section now refers to the automated agent (based on ml-on-gcp code) and that `create_gpu_metrics.py` is no longer used by this init action. - Revised troubleshooting tips to be more relevant to current issues. - Important Notes: - Added detailed "Performance & Caching" subsection explaining: - The GCS caching mechanism (`dataproc-temp-bucket`). - Potential long first-run times (up to 150 mins on small nodes) if compiling from source. - The recommendation and benefits of "pre-warming" the cache on a larger instance (reducing init action time to ~12-20 mins in some cases). - The security benefit of reduced attack surface when using cached artifacts (as build tools may not be needed). - Updated notes on SSHD hardening and APT source management. - Confirmed primary support for Dataproc 2.0+ images. - Formatting and Style: - Maintained the overall structure and line-wrapping style (aiming for ~80 columns) of the provided baseline README (md5sum 2daece9a7841cc4f5a0997fecf68cbd7) where feasible, while ensuring clarity and readability of the new and updated content.
/gcbrun |
Addressed issues related to salesforce case #59134938
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request delivers substantial improvements to the GPU initialization action (
gpu/install_gpu_driver.sh
) and a complete overhaul of its documentation (gpu/README.md
). Key enhancements include robust support for custom Dataproc image creation, expanded OS and software version compatibility, new metadata parameters for greater control, and significant performance/caching optimizations.The analysis of script changes, generation of the initial PR description, and the iterative refinement of the README.md were developed with the assistance of Gemini Advanced (May 2025).
Core Script Enhancements (
install_gpu_driver.sh
):invocation-type=custom-images
metadata).dataproc-gpu-config.service
), ensuring all base Dataproc components are present before configuration.cuda-url
,gpu-driver-url
: Allow users to specify direct HTTP/HTTPS URLs to custom CUDA toolkit and NVIDIA driver.run
files, overriding the script's default selection logic.include-pytorch
,gpu-conda-env
: Provide options for installing PyTorch and related ML libraries within a specified Conda environment.private_secret_name
,public_secret_name
,secret_project
,secret_version
,modulus_md5sum
for signing kernel modules when Secure Boot is enabled..run
files or building drivers from source (e.g., NVIDIA's open-gpu-kernel-modules) for improved reliability and version flexibility.dataproc-temp-bucket
. This significantly speeds up subsequent runs and cluster provisioning.Comprehensive README.md Overhaul (fixes #1267 ):
2daece9a7841cc4f5a0997fecf68cbd7
) as the structural and stylistic baseline.gcloud
examples for creating clusters, including MIG setup and using custom driver/CUDA URLs.invocation-type=custom-images
mechanism for use with image building tools likegenerate_custom_image.py
.cuda-url
/gpu-driver-url
expect HTTP/HTTPS).This PR significantly modernizes the GPU initialization action, making it more robust, flexible, configurable, and performant, especially for users building custom images or requiring specific software versions. The updated documentation provides clear and comprehensive guidance for all users.