From 52dff12da74b3383c13d5d1b2b9c869872d926df Mon Sep 17 00:00:00 2001 From: JohnZed <904524+JohnZed@users.noreply.github.com> Date: Thu, 9 May 2019 17:43:40 -0700 Subject: [PATCH 1/2] Comments on Readme This looks super-helpful, Matt! I hope you don't mind my pedantic comments... I think installation instructions are the most important (ideally aligned with other RAPIDS installation). Is there a docker container to try this out? Would be great to mention that too. Otherwise, adding a bit more big-picture around data movement and placement would be the other thing to thing about. --- README.md | 59 +++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 44 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index f54e6fa..63881eb 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,14 @@ # RAPIDS, Dask, and XGBoost +## _It would be nice to add a little more high-level/conceptual docs at the beginning... I'm happy to help out with that tomorrow if you +don't mind taking a look to review..._ + +## Also it would be great to have installation instructions and GPU/system requirements (Pascal+? Tested on linux-only?) + ## Dask -[Dask](https://dask.org "Dask: Scalable Analytics in Python") is a framework for flexibly scaling computation in Python. +[Dask](https://dask.org "Dask: Scalable Analytics in Python") is a framework for flexibly scaling computational graphs in Python on a single node or across a cluster. -It implements a distributed Pandas DataFrame to elastically scale over many workers. +It implements several distributed datastructured, most importantly a distributed Pandas-style DataFrame that elastically scales over many workers. ```python import dask.dataframe as dd @@ -22,9 +27,9 @@ df2 = df[df.y == 'a'].x + 1 ``` ## cuDF -[RAPIDS](https://rapids.ai "RAPIDS: Open GPU Data Science") is a collection of open source software libraries aimed at accelerating data science applications end-to-end. +[RAPIDS](https://rapids.ai "RAPIDS: Open GPU Data Science") is a collection of open source software libraries aimed at accelerating data science applications end-to-end with GPUs. -Much like Pandas, users can read in data, and perform various analytical tasks in parity with Pandas. +Using a Pandas-compatible API, users can read in data, and perform various analytical tasks on data on GPU. ```python import cudf @@ -42,8 +47,15 @@ The [RAPIDS Fork of XGBoost](https://github.com/rapidsai/xgboost "RAPIDS XGBoost ## Dask-XGBoost The [RAPIDS Fork of Dask-XGBoost](https://github.com/rapidsai/dask-xgboost/ "RAPIDS Dask-XGBoost") enables XGBoost with the distributed CUDA DataFrame via Dask-cuDF. A user may pass Dask-XGBoost a reference to a distributed cuDF object, and start a training session over an entire cluster from Python. +## _Would be great to have a high-level set of steps, like:_ +## Training an XGboost model across multiple GPUs with Dask-XGboost just requires you to: +## (1) Create a CUDA-aware Dask cluster, with one worker per GPU +## (2) Load data into a Dask-CuDF dataframe, distributed across GPUs +## (3) Call XGboost's distributed training functions + + # Using Dask-cuDF -A user may instantiate a Dask-cuDF cluster like this: +A user may instantiate a single-node Dask-cuDF cluster like this: ```python import cudf @@ -56,11 +68,14 @@ from dask_cuda import LocalCUDACluster import subprocess +## XXX: Can we do socket.gethostbyname(socket.gethostname()) or something like that? Seems a little distracting +# First, find our host's IP address cmd = "hostname --all-ip-addresses" process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE) output, error = process.communicate() IPADDR = str(output.decode()).split()[0] +# Now launch a Dask-CUDA cluster on the local host cluster = LocalCUDACluster(ip=IPADDR) client = Client(cluster) client @@ -68,24 +83,32 @@ client Note the use of `from dask_cuda import LocalCUDACluster`. [Dask-CUDA](https://github.com/rapidsai/dask-cuda) is a lightweight set of utilities useful for setting up a Dask cluster. These calls instantiate a Dask-cuDF cluster in a single node environment. To instantiate a multi-node Dask-cuDF cluster, a user must use `dask-scheduler` and `dask-cuda-worker`. +_Are there docs on multi-nodes setup? Maybe in an advanced section?) + Once a `client` is available, a user may commission the client to perform tasks: ```python -def foo(): - return +def do_computation(): + return 1 -client.run(foo) +client.run(do_computation) ``` -Alternatively, a user may employ `dask_cudf`: +Alternatively, a user may employ `dask_cudf` to distribute both data and computation over the workers: + +## _TODO: Not necessarily in this chunk, but I think one of the big missing pieces to me is understanding how +## dask_cudf distributes data across the workers / nodes / GPUs. I think this will be a common question for +## customers too_ + +## _Would it be better to show an example that uses dask to do the data loading?_ ```python pdf = pd.DataFrame( - {"x": np.random.randint(0, 5, size=10000), "y": np.random.normal(size=10000)}) - -gdf = cudf.DataFrame.from_pandas(pdf) + {"x": np.random.randint(0, 5, size=10000), + "y": np.random.normal(size=10000)}) # Construct Pandas structure in CPU memory -ddf = dask_cudf.from_cudf(gdf, npartitions=5) +gdf = cudf.DataFrame.from_pandas(pdf) # CuDF transfers the data to local GPU memory (??) +ddf = dask_cudf.from_cudf(gdf, npartitions=5) # Dask-CuDF distributes the data across all GPU workers (??) ddf = ddf.groupby("x").mean() ``` @@ -98,6 +121,8 @@ There are two functions of interest with Dask-XGBoost: The documentation for `dask_xgboost.train` is this: +## _As above, I think that talking about how data gets distributed would be helpful_ + ```python help(dask_xgboost.train) @@ -140,7 +165,7 @@ params = { 'num_rounds': 100, 'max_depth': 8, 'max_leaves': 2**8, - 'n_gpus': 1, + 'n_gpus': 1, # This will set the number of GPUs, per-worker 'tree_method': 'gpu_hist', 'objective': 'reg:squarederror', 'grow_policy': 'lossguide' @@ -203,4 +228,8 @@ rmse = np.sqrt(test.squared_error.mean().compute()) 2. `bst`: the Booster produced by the XGBoost training session 3. `x_test`: an instance of `dask_cudf.DataFrame` containing the data to be inferenced (acquire predictions) -`pred` will be an instance of `dask_cudf.Series`. We can use `dask.dataframe.multi.concat` to construct a `dask_cudf.DataFrame` by concatenating the list of `dask_cudf.Series` instances (`[pred]`). `test` is a `dask_cudf.DataFrame` object with a single column named `0` (e.g.) `test[0]` returns `pred`. Additionally, the root-mean-squared-error (RMSE) can be computed by constructing a new column and assigning to it the value of the difference between predicted and labeled values squared. This is encoded in the assignment `test['squared_error'] = (test[0] - y_test['x'])**2`. Finally, the mean can be computed by using an aggregator from the `dask_cudf` API. The entire computation is initiated via `.compute()`; finally, we take the square-root of the result, leaving us with `rmse = np.sqrt(test.squared_error.mean().compute())`. Note: `.squared_error` is an accessor for `test[squared_error]`. \ No newline at end of file +`pred` will be an instance of `dask_cudf.Series`. We can use `dask.dataframe.multi.concat` to construct a `dask_cudf.DataFrame` by concatenating the list of `dask_cudf.Series` instances (`[pred]`). `test` is a `dask_cudf.DataFrame` object with a single column named `0` (e.g.) `test[0]` returns `pred`. + +# This is a bit confusing... maybe just needs a link to dask docs explaining the use of compute? Wouldn't be bad to mention +# lazy-computation here +Additionally, the root-mean-squared-error (RMSE) can be computed by constructing a new column and assigning to it the value of the difference between predicted and labeled values squared. This is encoded in the assignment `test['squared_error'] = (test[0] - y_test['x'])**2`. Finally, the mean can be computed by using an aggregator from the `dask_cudf` API. The entire computation is initiated via `.compute()`; finally, we take the square-root of the result, leaving us with `rmse = np.sqrt(test.squared_error.mean().compute())`. Note: `.squared_error` is an accessor for `test[squared_error]`. From 55a65b388e25cb5870bf08e9bd9135cb65f2b994 Mon Sep 17 00:00:00 2001 From: John Zedlewski Date: Sat, 11 May 2019 21:52:52 -0700 Subject: [PATCH 2/2] Add more intro suggestions to docs --- README.md | 57 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 33 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 63881eb..d9d0ee0 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,22 @@ # RAPIDS, Dask, and XGBoost -## _It would be nice to add a little more high-level/conceptual docs at the beginning... I'm happy to help out with that tomorrow if you -don't mind taking a look to review..._ +Dask and XGboost are two of the most popular open source packages for data +processing and machine learning, respectively. RAPIDS augments them to support +accelerated computation across many GPUs on one or several nodes. -## Also it would be great to have installation instructions and GPU/system requirements (Pascal+? Tested on linux-only?) +RAPIDS integrates cuDF, a GPU-based dataframe implementation, with a CUDA-aware +version of Dask to accelerate data loading and preparation. These can pass data +seamlessly into XGBoost for model training, while reducing the amount of +unncessary data copying. -## Dask -[Dask](https://dask.org "Dask: Scalable Analytics in Python") is a framework for flexibly scaling computational graphs in Python on a single node or across a cluster. +# -It implements several distributed datastructured, most importantly a distributed Pandas-style DataFrame that elastically scales over many workers. +## Dask [Dask](https://dask.org "Dask: Scalable Analytics in Python") is a +framework for flexibly scaling computational graphs in Python on a single node +or across a cluster. + +It implements several distributed datastructures - most importantly a +distributed, Pandas-style DataFrame that elastically scales over many workers. ```python import dask.dataframe as dd @@ -20,8 +28,6 @@ df.head() 1 2 b 2 3 c 3 4 a - 4 5 b - 5 6 c df2 = df[df.y == 'a'].x + 1 ``` @@ -45,17 +51,24 @@ for column in gdf.columns: The [RAPIDS Fork of XGBoost](https://github.com/rapidsai/xgboost "RAPIDS XGBoost") enables XGBoost with cuDF: a user may directly pass a cuDF object into XGBoost for training, prediction, etc. ## Dask-XGBoost -The [RAPIDS Fork of Dask-XGBoost](https://github.com/rapidsai/dask-xgboost/ "RAPIDS Dask-XGBoost") enables XGBoost with the distributed CUDA DataFrame via Dask-cuDF. A user may pass Dask-XGBoost a reference to a distributed cuDF object, and start a training session over an entire cluster from Python. -## _Would be great to have a high-level set of steps, like:_ -## Training an XGboost model across multiple GPUs with Dask-XGboost just requires you to: -## (1) Create a CUDA-aware Dask cluster, with one worker per GPU -## (2) Load data into a Dask-CuDF dataframe, distributed across GPUs -## (3) Call XGboost's distributed training functions +The [RAPIDS Fork of Dask-XGBoost](https://github.com/rapidsai/dask-xgboost/ +"RAPIDS Dask-XGBoost") enables XGBoost with the distributed CUDA DataFrame via +Dask-cuDF. A user may pass Dask-XGBoost a reference to a distributed cuDF +object, and start a training session over an entire cluster from Python. + +# Building a Dask-cuDF-XGBoost pipeline + +Training an XGboost model across multiple GPUs with Dask-XGBoost requires 3 main steps: + (1) Create a CUDA-aware Dask instance, with one worker per GPU + (2) Load data into a Dask-CuDF dataframe, distributed across GPUs + (3) Call XGboost's distributed training functions + +## (1) Create a CUDA-aware Dask instance -# Using Dask-cuDF -A user may instantiate a single-node Dask-cuDF cluster like this: +Setting up the Dask instance to use all GPUs on a single machine is +straightforward: ```python import cudf @@ -81,11 +94,9 @@ client = Client(cluster) client ``` -Note the use of `from dask_cuda import LocalCUDACluster`. [Dask-CUDA](https://github.com/rapidsai/dask-cuda) is a lightweight set of utilities useful for setting up a Dask cluster. These calls instantiate a Dask-cuDF cluster in a single node environment. To instantiate a multi-node Dask-cuDF cluster, a user must use `dask-scheduler` and `dask-cuda-worker`. +Note the use of `from dask_cuda import LocalCUDACluster`. [Dask-CUDA](https://github.com/rapidsai/dask-cuda) is a lightweight set of utilities useful for setting up a Dask cluster. These calls instantiate a Dask-cuDF cluster in a single node environment. To instantiate a multi-node Dask-cuDF cluster, a user must use `dask-scheduler` and `dask-cuda-worker`. See the [Dask distributed documentation][http://distributed.dask.org/en/latest/setup.html] for more details. -_Are there docs on multi-nodes setup? Maybe in an advanced section?) - -Once a `client` is available, a user may commission the client to perform tasks: +Once a `client` is available, a user may use the client to run tasks across the workers: ```python def do_computation(): @@ -99,7 +110,7 @@ Alternatively, a user may employ `dask_cudf` to distribute both data and computa ## _TODO: Not necessarily in this chunk, but I think one of the big missing pieces to me is understanding how ## dask_cudf distributes data across the workers / nodes / GPUs. I think this will be a common question for ## customers too_ - +## ## _Would it be better to show an example that uses dask to do the data loading?_ ```python @@ -119,9 +130,7 @@ There are two functions of interest with Dask-XGBoost: 1. `dask_xgboost.train` 2. `dask_xgboost.predict` -The documentation for `dask_xgboost.train` is this: - -## _As above, I think that talking about how data gets distributed would be helpful_ +The documentation for `dask_xgboost.train` provides the clearest explanation: ```python help(dask_xgboost.train)