You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pages/data-lab/concepts.mdx
+21-10Lines changed: 21 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,13 @@ dates:
6
6
validation: 2025-09-02
7
7
---
8
8
9
-
## Apache Spark cluster
9
+
## Apache Spark™ cluster
10
10
11
-
An Apache Spark cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark cluster is a Kubernetes cluster, with Apache Spark installed in each Pod. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html).
11
+
An Apache Spark™ cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark™ cluster is a Kubernetes cluster, with Apache Spark™ installed in each Pod. For more details, check out the [Apache Spark™ documentation](https://spark.apache.org/documentation.html).
12
12
13
13
## Data Lab
14
14
15
-
A Data Lab is a project setup that combines a Notebook and an Apache Spark Cluster for data analysis and experimentation. it comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights.
15
+
A Data Lab is a project setup that combines a Notebook and an Apache Spark™ cluster for data analysis and experimentation. It includes the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights.
16
16
17
17
## Data Lab for Apache Spark™
18
18
@@ -24,33 +24,44 @@ A fixture is a set of data forming a request used for testing purposes.
24
24
25
25
## GPU
26
26
27
-
GPUs (Graphical Processing Units) allow Apache Spark to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models.
27
+
GPUs (Graphical Processing Units) allow Apache Spark™ to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models.
28
28
29
29
## JupyterLab
30
30
31
31
JupyterLab is a web-based platform for interactive computing, letting you work with notebooks, code, and data all in one place. It builds on the classic Jupyter Notebook by offering a more flexible and integrated user interface, making it easier to handle various file formats and interactive components.
32
32
33
33
## Lighter
34
34
35
-
Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter).
35
+
Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark™ cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter).
36
+
37
+
## Main node
38
+
39
+
The main node in an Apache Spark™ cluster is the driver node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster.
40
+
36
41
37
42
## Notebook
38
43
39
-
A notebook for an Apache Spark cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows.
44
+
A notebook for an Apache Spark™ cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark™ cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows.
45
+
46
+
Adding a notebook to your cluster requires 1 GB of storage.
40
47
41
48
## Persistent volume
42
49
43
50
A Persistent Volume (PV) is a cluster-wide storage resource that ensures data persistence beyond the lifecycle of individual Pods. Persistent volumes abstract the underlying storage details, allowing administrators to use various storage solutions.
44
51
45
-
Apache Spark® executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers.
52
+
Apache Spark™ executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers.
46
53
47
-
A PV sized properly ensures a smooth execution of your workload.
54
+
A persistent volume sized properly ensures a smooth execution of your workload.
48
55
49
56
## SparkMagic
50
57
51
-
SparkMagic is a set of tools that allows you to interact with Apache Spark clusters through Jupyter notebooks. It provides magic commands for running Spark jobs, querying data, and managing Spark sessions directly within the notebook interface, facilitating seamless integration and execution of Spark tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic).
58
+
SparkMagic is a set of tools that allows you to interact with Apache Spark™ clusters through Jupyter notebooks. It provides magic commands for running Spark™ jobs, querying data, and managing Spark™ sessions directly within the notebook interface, facilitating seamless integration and execution of Spark™ tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic).
52
59
53
60
54
61
## Transaction
55
62
56
-
An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error.
63
+
An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error.
64
+
65
+
## Worker nodes
66
+
67
+
Worker nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM.
Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
13
+
Apache Spark™ is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark™ offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
14
14
15
-
### How does Apache Spark work?
15
+
### How does Apache Spark™ work?
16
16
17
-
Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.
17
+
Apache Spark™ processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.
18
18
19
19
### What workloads is Data Lab for Apache Spark™ suited for?
20
20
@@ -24,39 +24,35 @@ Data Lab for Apache Spark™ supports a range of workloads, including:
24
24
- Machine learning tasks
25
25
- High-speed operations on large datasets
26
26
27
-
It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark library support.
27
+
It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark™ library support.
28
28
29
29
## Offering and availability
30
30
31
-
### What data source options are available?
32
-
33
-
Data Lab natively integrates with Scaleway Object Storage for reading and writing data, making it easy to process data directly from your buckets. Your buckets are accessible using the Scaleway console or any other Amazon S3-compatible CLI tool.
34
-
35
31
### What notebook is included with Dedicated Data Labs?
36
32
37
-
The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations.
33
+
The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark™ cluster for seamless data processing and calculations.
38
34
39
35
## Pricing and billing
40
36
41
37
### How am I billed for Data Lab for Apache Spark™?
42
38
43
-
Data Lab for Apache Spark™ is billed based on two factors:
44
-
- The main node configuration selected
39
+
Data Lab for Apache Spark™ is billed based on the following factors:
40
+
- The main node configuration selected.
45
41
- The worker node configuration selected, and the number of worker nodes in the cluster.
42
+
- The persistent volume size provisioned.
43
+
- The presence of a notebook.
46
44
47
45
## Compatibility and integration
48
46
49
47
### Can I run a Data Lab for Apache Spark™ using GPUs?
50
48
51
49
Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia's [RAPIDS Accelerator For Apache Spark](https://www.nvidia.com/en-gb/deep-learning-ai/software/rapids/), an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing.
52
50
53
-
### Can I connect to S3 buckets from other cloud providers?
54
-
55
-
Currently, connections are limited to Scaleway's Object Storage environment.
51
+
### Can I connect a separate notebook environment to the Data Lab?
56
52
57
-
### Can I connect my local JupyterLab to the Data Lab?
53
+
Yes, you can connect a different notebook via Private Networks.
58
54
59
-
Remote connections to a Data Lab cluster are currently not supported.
55
+
Refer to the [dedicated documentation](/data-lab/how-to/use-private-networks/) for comprehensive information on how to connect to a Data Lab for Apache Spark™ cluster over Private Networks.
This page explains how to access the Apache Spark™ UI of your Data Lab for Apache Spark™ cluster.
13
+
14
+
<Requirements />
15
+
16
+
- A Scaleway account logged into the [console](https://console.scaleway.com)
17
+
-[Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
18
+
- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/)
19
+
- Created an [IAM API key](/iam/how-to/create-api-keys/)
20
+
21
+
1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays.
22
+
23
+
2. Click the name of the desired Data Lab cluster. The overview tab of the cluster displays.
24
+
25
+
3. Click the **Open Apache Spark™ UI** button. A login page displays.
26
+
27
+
4. Enter the **secret key** of your API key, then click **Authenticate**. The Apache Spark™ UI dashboard displays.
28
+
29
+
From this page, you can view and monitor worker nodes, executors, and applications.
30
+
31
+
Refer to the [official Apache Spark™ documentation](https://spark.apache.org/docs/latest/web-ui.html) for comprehensive information on how to use the web UI.
Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure.
11
+
Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark™ infrastructure.
12
12
13
13
<Requirements />
14
14
15
15
- A Scaleway account logged into the [console](https://console.scaleway.com)
16
16
-[Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
17
-
- Optionally, an [Object Storage bucket](/object-storage/how-to/create-a-bucket/)
18
17
- A valid [API key](/iam/how-to/create-api-keys/)
18
+
- Created a [Private Network](/vpc/how-to/create-private-network/)
19
19
20
20
1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays.
21
21
22
22
2. Click **Create Data Lab cluster**. The creation wizard displays.
23
23
24
-
3. Complete the following steps in the wizard:
25
-
- Choose an Apache Spark version from the drop-down menu.
26
-
- Select a worker node configuration.
27
-
- Enter the desired number of worker nodes.
28
-
<Messagetype="note">
29
-
Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations.
30
-
</Message>
31
-
- Activate the [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs.
32
-
<Messagetype="note">
33
-
Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook.
34
-
</Message>
35
-
- Enter a name for your Data Lab.
36
-
- Optionally, add a description and/or tags for your Data Lab.
37
-
- Verify the estimated cost.
38
-
39
-
4. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page.
24
+
3. Choose an Apache Spark™ version from the drop-down menu.
25
+
26
+
4. Choose a main node type. If you plan to add a notebook to your cluster, select the **DDL-PLAY2-MICRO** configuration to provision sufficient resources for it.
27
+
28
+
5. Choose a worker node type depending on your hardware requirements.
29
+
30
+
6. Enter the desired number of worker nodes.
31
+
32
+
7. Add a [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs.
33
+
34
+
<Messagetype="note">
35
+
Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook.
36
+
</Message>
37
+
38
+
8. Add a notebook if you want to use an integrated notebook environment to interact with your cluster. Adding a notebook requires 1 GB of billable storage.
39
+
40
+
9. Select a Private Network from the drop-down menu to attach to your cluster, or create a new one. Data Lab clusters cannot be used without a Private Network.
41
+
42
+
10. Enter a name for your Data Lab cluster, and add an optional description and/or tags.
43
+
44
+
11. Verify the estimated cost.
45
+
46
+
12. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page.
0 commit comments