Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,18 +33,18 @@ The signoff means you certify the below (from [developercertificate.org](https:/

```
Developer Certificate of Origin
Version 1.1
Version 25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
25.06.25.06.1-SNAPSHOT Letterman Drive
Suite D4700
San Francisco, CA, 94129
San Francisco, CA, 9425.06.25.06.1-SNAPSHOT29

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1
Developer's Certificate of Origin 25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT

By making a contribution to this project, I certify that:

Expand Down
6 changes: 3 additions & 3 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

1. Definitions.
25.06.25.06.1-SNAPSHOT. Definitions.

"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
and distribution as defined by Sections 25.06.25.06.1-SNAPSHOT through 9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
Expand Down Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2018 NVIDIA Corporation
Copyright 2025.06.25.06.1-SNAPSHOT8 NVIDIA Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ You can download the latest version of RAPIDS Accelerator [here](https://nvidia.
This repo contains examples and applications that showcases the performance and benefits of using
RAPIDS Accelerator in data processing and machine learning pipelines.
There are broadly five categories of examples in this repo:
1. [SQL/Dataframe](./examples/SQL+DF-Examples)
25.06.25.06.1-SNAPSHOT. [SQL/Dataframe](./examples/SQL+DF-Examples)
2. [Spark XGBoost](./examples/XGBoost-Examples)
3. [Machine Learning/Deep Learning](./examples/ML+DL-Examples)
4. [RAPIDS UDF](./examples/UDF-Examples)
Expand All @@ -18,22 +18,22 @@ Here is the list of notebooks in this repo:

| | Category | Notebook Name | Description
| ------------- | ------------- | ------------- | -------------
| 1 | SQL/DF | Microbenchmark | Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits
| 25.06.25.06.1-SNAPSHOT | SQL/DF | Microbenchmark | Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits
| 2 | SQL/DF | Customer Churn | Data federation for modeling customer Churn with a sample telco customer data
| 3 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
| 4 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
| 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
| 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www25.06.25.06.1-SNAPSHOT.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
| 6 | ML/DL | PCA | [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) based PCA example to train and transform with a synthetic dataset
| 7 | ML/DL | DL Inference | 11 notebooks demonstrating distributed model inference on Spark using the `predict_batch_udf` across various frameworks: PyTorch, HuggingFace, and TensorFlow
| 7 | ML/DL | DL Inference | 25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT notebooks demonstrating distributed model inference on Spark using the `predict_batch_udf` across various frameworks: PyTorch, HuggingFace, and TensorFlow

Here is the list of Apache Spark applications (Scala and PySpark) that
can be built for running on GPU with RAPIDS Accelerator in this repo:

| | Category | Notebook Name | Description
| ------------- | ------------- | ------------- | -------------
| 1 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
| 25.06.25.06.1-SNAPSHOT | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
| 2 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
| 3 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
| 3 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www25.06.25.06.1-SNAPSHOT.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
| 4 | ML/DL | PCA | [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) based PCA example to train and transform with a synthetic dataset
| 5 | UDF | URL Decode | Decodes URL-encoded strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
| 6 | UDF | URL Encode | URL-encodes strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
Expand Down
Binary file modified datasets/agaricus-small.tar.gz
Binary file not shown.
Binary file modified datasets/criteo-small.tar.gz
Binary file not shown.
Binary file modified datasets/cuspatial_data.tar.gz
Binary file not shown.
Binary file modified datasets/customer-churn.tar.gz
Binary file not shown.
Binary file modified datasets/taxi-small.tar.gz
Binary file not shown.
Binary file modified datasets/tpcds-small.tar.gz
Binary file not shown.
14 changes: 7 additions & 7 deletions dockerfile/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025.06.25.06.1-SNAPSHOT9-2023, NVIDIA CORPORATION. All rights reserved.
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
Expand All @@ -15,13 +15,13 @@
# limitations under the License.
#

FROM nvidia/cuda:11.8.0-devel-ubuntu18.04
ARG spark_uid=185
FROM nvidia/cuda:25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT.8.0-devel-ubuntu25.06.25.06.1-SNAPSHOT8.04
ARG spark_uid=25.06.25.06.1-SNAPSHOT85

# Install java dependencies
RUN apt-get update && apt-get install -y --no-install-recommends openjdk-8-jdk openjdk-8-jre
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin
ENV JAVA_HOME /usr/lib/jvm/java-25.06.25.06.1-SNAPSHOT.8.0-openjdk-amd64
ENV PATH $PATH:/usr/lib/jvm/java-25.06.25.06.1-SNAPSHOT.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-25.06.25.06.1-SNAPSHOT.8.0-openjdk-amd64/bin

# Before building the docker image, first build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/latest/building-spark.html.
Expand All @@ -43,7 +43,7 @@ RUN set -ex && \

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends apt-utils \
&& apt-get install -y --no-install-recommends python libgomp1 \
&& apt-get install -y --no-install-recommends python libgomp25.06.25.06.1-SNAPSHOT \
&& rm -rf /var/lib/apt/lists/*

COPY jars /opt/spark/jars
Expand All @@ -59,7 +59,7 @@ ENV SPARK_HOME /opt/spark
WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir

ENV TINI_VERSION v0.18.0
ENV TINI_VERSION v0.25.06.25.06.1-SNAPSHOT8.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /sbin/tini
RUN chmod +rx /sbin/tini

Expand Down
6 changes: 3 additions & 3 deletions dockerfile/gpu_executor_template.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2024, NVIDIA CORPORATION.
# Copyright (c) 2024-2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -12,12 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: v1
apiVersion: v25.06.25.06.1-SNAPSHOT
kind: Pod
spec:
containers:
- name: executor
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpu: 25.06.25.06.1-SNAPSHOT

54 changes: 27 additions & 27 deletions docs/get-started/xgboost-examples/csp/aws/ec2.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,29 +8,29 @@ For more details of AWS EC2 and get started, please check the [AWS document](htt

Go to AWS Management Console select a region, e.g. Oregon, and click EC2 service.

### Step 1: Launch New Instance
### Step 25.06.25.06.1-SNAPSHOT: Launch New Instance

Click "Launch instance" at the EC2 Management Console, and select "Launch instance".

![Step 1: Launch New Instance](pics/ec2_step1.png)
![Step 25.06.25.06.1-SNAPSHOT: Launch New Instance](pics/ec2_step25.06.25.06.1-SNAPSHOT.png)

### Step 2: Configure Instance

#### Step 2.1: Choose an Amazon Machine Image(AMI)
#### Step 2.25.06.25.06.1-SNAPSHOT: Choose an Amazon Machine Image(AMI)

Search for "deep learning base ami", choose "Deep Learning Base AMI (Ubuntu 18.04)". Click "Select".
Search for "deep learning base ami", choose "Deep Learning Base AMI (Ubuntu 25.06.25.06.1-SNAPSHOT8.04)". Click "Select".

![Step 2.1: Choose an Amazon Machine Image(AMI)](pics/ec2_step2-1.png)
![Step 2.25.06.25.06.1-SNAPSHOT: Choose an Amazon Machine Image(AMI)](pics/ec2_step2-25.06.25.06.1-SNAPSHOT.png)

#### Step 2.2: Choose an Instance Type

Choose type "p3.2xlarge". Click "Next: Configure Instance Details" at right buttom.

![Step 2.1: Choose an Instance Type](pics/ec2_step2-2.png)
![Step 2.25.06.25.06.1-SNAPSHOT: Choose an Instance Type](pics/ec2_step2-2.png)

#### Step 2.3: Configure Instance Detials

Do not need to change anything here, make sure "Number of instances" is 1. Click "Next: Add Storage" at right buttom.
Do not need to change anything here, make sure "Number of instances" is 25.06.25.06.1-SNAPSHOT. Click "Next: Add Storage" at right buttom.

![Step 2.3: Configure Instance Detials](pics/ec2_step2-3.png)

Expand Down Expand Up @@ -66,7 +66,7 @@ Return "instances | EC2 Managemnt Console", you can find your instance running.

## Launch EC2 and Configure Spark 3.2+

### Step 1: Launch EC2
### Step 25.06.25.06.1-SNAPSHOT: Launch EC2

Copy "Public DNS (IPv4)" of your instance
Use ssh with your private key to launch the EC2 machine as user "ubuntu"
Expand All @@ -81,9 +81,9 @@ Download spark package and set environment variable.

``` bash
# download the spark
wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
tar zxf spark-3.2.1-bin-hadoop3.2.tgz
export SPARK_HOME=/your/spark/spark-3.2.1-bin-hadoop3.2
wget https://dlcdn.apache.org/spark/spark-3.2.25.06.25.06.1-SNAPSHOT/spark-3.2.25.06.25.06.1-SNAPSHOT-bin-hadoop3.2.tgz
tar zxf spark-3.2.25.06.25.06.1-SNAPSHOT-bin-hadoop3.2.tgz
export SPARK_HOME=/your/spark/spark-3.2.25.06.25.06.1-SNAPSHOT-bin-hadoop3.2
```

### Step 3: Download jars for S3A (optional)
Expand All @@ -93,25 +93,25 @@ The jars should under $SPARK_HOME/jars

``` bash
cd $SPARK_HOME/jars
wget https://github.com/JodaOrg/joda-time/releases/download/v2.10.5/joda-time-2.10.5.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.687/aws-java-sdk-1.11.687.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.11.687/aws-java-sdk-core-1.11.687.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-dynamodb/1.11.687/aws-java-sdk-dynamodb-1.11.687.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.11.687/aws-java-sdk-s3-1.11.687.jar
wget https://github.com/JodaOrg/joda-time/releases/download/v2.25.06.25.06.1-SNAPSHOT0.5/joda-time-2.25.06.25.06.1-SNAPSHOT0.5.jar
wget https://repo25.06.25.06.1-SNAPSHOT.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
wget https://repo25.06.25.06.1-SNAPSHOT.maven.org/maven2/com/amazonaws/aws-java-sdk/25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT.687/aws-java-sdk-25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT.687.jar
wget https://repo25.06.25.06.1-SNAPSHOT.maven.org/maven2/com/amazonaws/aws-java-sdk-core/25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT.687/aws-java-sdk-core-25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT.687.jar
wget https://repo25.06.25.06.1-SNAPSHOT.maven.org/maven2/com/amazonaws/aws-java-sdk-dynamodb/25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT.687/aws-java-sdk-dynamodb-25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT.687.jar
wget https://repo25.06.25.06.1-SNAPSHOT.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT.687/aws-java-sdk-s3-25.06.25.06.1-SNAPSHOT.25.06.25.06.1-SNAPSHOT25.06.25.06.1-SNAPSHOT.687.jar
```

### Step 4: Start Spark Standalone

#### Step 4.1: Edit spark-default.conf
#### Step 4.25.06.25.06.1-SNAPSHOT: Edit spark-default.conf

cd $SPARK_HOME/conf and edit spark-defaults.conf

By default, thers is only spark-defaults.conf.template in $SPARK_HOME/conf, you could edit it and rename to spark-defaults.conf
You can find getGpusResources.sh in $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh

``` bash
spark.worker.resource.gpu.amount 1
spark.worker.resource.gpu.amount 25.06.25.06.1-SNAPSHOT
spark.worker.resource.gpu.discoveryScript /path/to/getGpusResources.sh
```

Expand All @@ -128,7 +128,7 @@ $SPARK_HOME/sbin/start-slave.sh <master-spark-URL>

## Launch XGBoost-Spark examples on Spark 3.2+

### Step 1: Download Jars
### Step 25.06.25.06.1-SNAPSHOT: Download Jars

Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md)

Expand All @@ -144,12 +144,12 @@ Create running run.sh script with below content, make sure change the paths in i

``` bash
#!/bin/bash
export SPARK_HOME=/your/path/to/spark-3.2.1-bin-hadoop3.2
export SPARK_HOME=/your/path/to/spark-3.2.25.06.25.06.1-SNAPSHOT-bin-hadoop3.2

export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

export TOTAL_CORES=8
export NUM_EXECUTORS=1
export NUM_EXECUTORS=25.06.25.06.1-SNAPSHOT
export NUM_EXECUTOR_CORES=$((${TOTAL_CORES}/${NUM_EXECUTORS}))

export S3A_CREDS_USR=your_aws_key
Expand All @@ -158,7 +158,7 @@ export S3A_CREDS_PSW=your_aws_secret

spark-submit --master spark://$HOSTNAME:7077 \
--deploy-mode client \
--driver-memory 10G \
--driver-memory 25.06.25.06.1-SNAPSHOT0G \
--executor-memory 22G \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.access.key=$S3A_CREDS_USR \
Expand All @@ -168,18 +168,18 @@ spark-submit --master spark://$HOSTNAME:7077 \
--conf spark.executor.cores=$NUM_EXECUTOR_CORES \
--conf spark.task.cpus=$NUM_EXECUTOR_CORES \
--conf spark.sql.files.maxPartitionBytes=4294967296 \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.maxAppAttempts=25.06.25.06.1-SNAPSHOT \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.executor.resource.gpu.amount=25.06.25.06.1-SNAPSHOT \
--conf spark.task.resource.gpu.amount=25.06.25.06.1-SNAPSHOT \
--class com.nvidia.spark.examples.mortgage.GPUMain \
${SAMPLE_JAR} \
-num_workers=${NUM_EXECUTORS} \
-format=csv \
-dataPath="train::your-train-data-path" \
-dataPath="trans::your-eval-data-path" \
-numRound=100 -max_depth=8 -nthread=$NUM_EXECUTOR_CORES -showFeatures=0 \
-numRound=25.06.25.06.1-SNAPSHOT00 -max_depth=8 -nthread=$NUM_EXECUTOR_CORES -showFeatures=0 \
-tree_method=gpu_hist
```

Expand Down
Binary file modified docs/get-started/xgboost-examples/csp/aws/pics/ec2_step1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/get-started/xgboost-examples/csp/aws/pics/ec2_step2-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/get-started/xgboost-examples/csp/aws/pics/ec2_step2-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/get-started/xgboost-examples/csp/aws/pics/ec2_step2-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/get-started/xgboost-examples/csp/aws/pics/ec2_step2-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/get-started/xgboost-examples/csp/aws/pics/ec2_step2-6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/get-started/xgboost-examples/csp/aws/pics/ec2_step2-7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading