Skip to content

Commit ba69f0a

Browse files
SamyOubouazizldecarvalho-docjcirinosclwy
authored
feat(dlb): add v2 doc MTA-6795 (#5920)
* feat(dlb): add v2 doc MTA-6795 * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update * Apply suggestions from code review Co-authored-by: ldecarvalho-doc <82805470+ldecarvalho-doc@users.noreply.github.com> * Update pages/data-lab/concepts.mdx Co-authored-by: ldecarvalho-doc <82805470+ldecarvalho-doc@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Jessica <113192637+jcirinosclwy@users.noreply.github.com> * feat(dlb): update * feat(dlb): update * feat(dlb): update * feat(dlb): update --------- Co-authored-by: ldecarvalho-doc <82805470+ldecarvalho-doc@users.noreply.github.com> Co-authored-by: Jessica <113192637+jcirinosclwy@users.noreply.github.com>
1 parent 3f93cd2 commit ba69f0a

File tree

10 files changed

+380
-148
lines changed

10 files changed

+380
-148
lines changed

pages/data-lab/concepts.mdx

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@ dates:
66
validation: 2025-09-02
77
---
88

9-
## Apache Spark cluster
9+
## Apache Spark cluster
1010

11-
An Apache Spark cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark cluster is a Kubernetes cluster, with Apache Spark installed in each Pod. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html).
11+
An Apache Spark cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark cluster is a Kubernetes cluster, with Apache Spark installed in each Pod. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html).
1212

1313
## Data Lab
1414

15-
A Data Lab is a project setup that combines a Notebook and an Apache Spark Cluster for data analysis and experimentation. it comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights.
15+
A Data Lab is a project setup that combines a Notebook and an Apache Spark™ cluster for data analysis and experimentation. It includes the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights.
1616

1717
## Data Lab for Apache Spark™
1818

@@ -24,33 +24,44 @@ A fixture is a set of data forming a request used for testing purposes.
2424

2525
## GPU
2626

27-
GPUs (Graphical Processing Units) allow Apache Spark to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models.
27+
GPUs (Graphical Processing Units) allow Apache Spark to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models.
2828

2929
## JupyterLab
3030

3131
JupyterLab is a web-based platform for interactive computing, letting you work with notebooks, code, and data all in one place. It builds on the classic Jupyter Notebook by offering a more flexible and integrated user interface, making it easier to handle various file formats and interactive components.
3232

3333
## Lighter
3434

35-
Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter).
35+
Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark™ cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter).
36+
37+
## Main node
38+
39+
The main node in an Apache Spark™ cluster is the driver node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster.
40+
3641

3742
## Notebook
3843

39-
A notebook for an Apache Spark cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows.
44+
A notebook for an Apache Spark™ cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark™ cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows.
45+
46+
Adding a notebook to your cluster requires 1 GB of storage.
4047

4148
## Persistent volume
4249

4350
A Persistent Volume (PV) is a cluster-wide storage resource that ensures data persistence beyond the lifecycle of individual Pods. Persistent volumes abstract the underlying storage details, allowing administrators to use various storage solutions.
4451

45-
Apache Spark® executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers.
52+
Apache Spark executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers.
4653

47-
A PV sized properly ensures a smooth execution of your workload.
54+
A persistent volume sized properly ensures a smooth execution of your workload.
4855

4956
## SparkMagic
5057

51-
SparkMagic is a set of tools that allows you to interact with Apache Spark clusters through Jupyter notebooks. It provides magic commands for running Spark jobs, querying data, and managing Spark sessions directly within the notebook interface, facilitating seamless integration and execution of Spark tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic).
58+
SparkMagic is a set of tools that allows you to interact with Apache Spark clusters through Jupyter notebooks. It provides magic commands for running Spark jobs, querying data, and managing Spark sessions directly within the notebook interface, facilitating seamless integration and execution of Spark tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic).
5259

5360

5461
## Transaction
5562

56-
An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error.
63+
An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error.
64+
65+
## Worker nodes
66+
67+
Worker nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM.

pages/data-lab/faq.mdx

Lines changed: 12 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,11 @@ productIcon: DistributedDataLabProductIcon
1010

1111
### What is Apache Spark?
1212

13-
Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
13+
Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
1414

15-
### How does Apache Spark work?
15+
### How does Apache Spark work?
1616

17-
Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.
17+
Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.
1818

1919
### What workloads is Data Lab for Apache Spark™ suited for?
2020

@@ -24,39 +24,35 @@ Data Lab for Apache Spark™ supports a range of workloads, including:
2424
- Machine learning tasks
2525
- High-speed operations on large datasets
2626

27-
It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark library support.
27+
It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark library support.
2828

2929
## Offering and availability
3030

31-
### What data source options are available?
32-
33-
Data Lab natively integrates with Scaleway Object Storage for reading and writing data, making it easy to process data directly from your buckets. Your buckets are accessible using the Scaleway console or any other Amazon S3-compatible CLI tool.
34-
3531
### What notebook is included with Dedicated Data Labs?
3632

37-
The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations.
33+
The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations.
3834

3935
## Pricing and billing
4036

4137
### How am I billed for Data Lab for Apache Spark™?
4238

43-
Data Lab for Apache Spark™ is billed based on two factors:
44-
- The main node configuration selected
39+
Data Lab for Apache Spark™ is billed based on the following factors:
40+
- The main node configuration selected.
4541
- The worker node configuration selected, and the number of worker nodes in the cluster.
42+
- The persistent volume size provisioned.
43+
- The presence of a notebook.
4644

4745
## Compatibility and integration
4846

4947
### Can I run a Data Lab for Apache Spark™ using GPUs?
5048

5149
Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia's [RAPIDS Accelerator For Apache Spark](https://www.nvidia.com/en-gb/deep-learning-ai/software/rapids/), an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing.
5250

53-
### Can I connect to S3 buckets from other cloud providers?
54-
55-
Currently, connections are limited to Scaleway's Object Storage environment.
51+
### Can I connect a separate notebook environment to the Data Lab?
5652

57-
### Can I connect my local JupyterLab to the Data Lab?
53+
Yes, you can connect a different notebook via Private Networks.
5854

59-
Remote connections to a Data Lab cluster are currently not supported.
55+
Refer to the [dedicated documentation](/data-lab/how-to/use-private-networks/) for comprehensive information on how to connect to a Data Lab for Apache Spark™ cluster over Private Networks.
6056

6157
## Usage and management
6258

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
title: How to access and use the notebook of a Data Lab cluster
3+
description: Step-by-step guide to access and use the notebook environment in a Data Lab for Apache Spark™ on Scaleway.
4+
tags: data lab apache spark notebook environment jupyterlab
5+
dates:
6+
validation: 2025-12-04
7+
posted: 2025-12-04
8+
---
9+
10+
import Requirements from '@macros/iam/requirements.mdx'
11+
12+
This page explains how to access and use the notebook environment of your Data Lab for Apache Spark™ cluster.
13+
14+
<Requirements />
15+
16+
- A Scaleway account logged into the [console](https://console.scaleway.com)
17+
- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
18+
- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/) with a notebook
19+
- Created an [IAM API key](/iam/how-to/create-api-keys/)
20+
21+
## How to access the notebook of your cluster
22+
23+
1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays.
24+
25+
2. Click the name of the desired Data Lab cluster. The overview tab of the cluster displays.
26+
27+
3. Click the **Open notebook** button. A login page displays.
28+
29+
4. Enter the **secret key** of your API key, then click **Authenticate**. The notebook dashboard displays.
30+
31+
You are now connected to your notebook environment.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
title: How to access the Apache Spark™ UI
3+
description: Step-by-step guide to access and use the Apache Spark™ UI in a Data Lab for Apache Spark™ on Scaleway.
4+
tags: data lab apache spark ui gui console
5+
dates:
6+
validation: 2025-12-04
7+
posted: 2025-12-04
8+
---
9+
10+
import Requirements from '@macros/iam/requirements.mdx'
11+
12+
This page explains how to access the Apache Spark™ UI of your Data Lab for Apache Spark™ cluster.
13+
14+
<Requirements />
15+
16+
- A Scaleway account logged into the [console](https://console.scaleway.com)
17+
- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
18+
- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/)
19+
- Created an [IAM API key](/iam/how-to/create-api-keys/)
20+
21+
1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays.
22+
23+
2. Click the name of the desired Data Lab cluster. The overview tab of the cluster displays.
24+
25+
3. Click the **Open Apache Spark™ UI** button. A login page displays.
26+
27+
4. Enter the **secret key** of your API key, then click **Authenticate**. The Apache Spark™ UI dashboard displays.
28+
29+
From this page, you can view and monitor worker nodes, executors, and applications.
30+
31+
Refer to the [official Apache Spark™ documentation](https://spark.apache.org/docs/latest/web-ui.html) for comprehensive information on how to use the web UI.

pages/data-lab/how-to/connect-to-data-lab.mdx

Lines changed: 0 additions & 38 deletions
This file was deleted.

pages/data-lab/how-to/create-data-lab.mdx

Lines changed: 26 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,37 +3,44 @@ title: How to create a Data Lab for Apache Spark™
33
description: Step-by-step guide to creating a Data Lab for Apache Spark™ on Scaleway.
44
tags: data lab apache spark create process
55
dates:
6-
validation: 2025-09-02
6+
validation: 2025-12-10
77
posted: 2024-07-31
88
---
99
import Requirements from '@macros/iam/requirements.mdx'
1010

11-
Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure.
11+
Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure.
1212

1313
<Requirements />
1414

1515
- A Scaleway account logged into the [console](https://console.scaleway.com)
1616
- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
17-
- Optionally, an [Object Storage bucket](/object-storage/how-to/create-a-bucket/)
1817
- A valid [API key](/iam/how-to/create-api-keys/)
18+
- Created a [Private Network](/vpc/how-to/create-private-network/)
1919

2020
1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays.
2121

2222
2. Click **Create Data Lab cluster**. The creation wizard displays.
2323

24-
3. Complete the following steps in the wizard:
25-
- Choose an Apache Spark version from the drop-down menu.
26-
- Select a worker node configuration.
27-
- Enter the desired number of worker nodes.
28-
<Message type="note">
29-
Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations.
30-
</Message>
31-
- Activate the [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs.
32-
<Message type="note">
33-
Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook.
34-
</Message>
35-
- Enter a name for your Data Lab.
36-
- Optionally, add a description and/or tags for your Data Lab.
37-
- Verify the estimated cost.
38-
39-
4. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page.
24+
3. Choose an Apache Spark™ version from the drop-down menu.
25+
26+
4. Choose a main node type. If you plan to add a notebook to your cluster, select the **DDL-PLAY2-MICRO** configuration to provision sufficient resources for it.
27+
28+
5. Choose a worker node type depending on your hardware requirements.
29+
30+
6. Enter the desired number of worker nodes.
31+
32+
7. Add a [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs.
33+
34+
<Message type="note">
35+
Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook.
36+
</Message>
37+
38+
8. Add a notebook if you want to use an integrated notebook environment to interact with your cluster. Adding a notebook requires 1 GB of billable storage.
39+
40+
9. Select a Private Network from the drop-down menu to attach to your cluster, or create a new one. Data Lab clusters cannot be used without a Private Network.
41+
42+
10. Enter a name for your Data Lab cluster, and add an optional description and/or tags.
43+
44+
11. Verify the estimated cost.
45+
46+
12. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page.

0 commit comments

Comments
 (0)