|
1 | | -(spark-getting-started)= |
| 1 | +(spark-usage)= |
2 | 2 | # Getting started with Apache Spark and CrateDB |
3 | 3 |
|
4 | 4 | **Apache Spark** is an open-source distributed computing framework designed for high-speed, versatile big-data processing. It offers support for various data processing tasks, such as batch processing, real-time streaming, machine learning, and graph analytics. It is a popular choice for organizations looking to analyze large datasets efficiently. |
5 | 5 |
|
6 | | -Using Apache Spark with CrateDB is a powerful combination for processing and analyzing large datasets. In this tutorial, we'll walk through the process of setting up PySpark (Python API for Spark) to work with CrateDB, including data loading, processing, and writing results back to CrateDB. |
| 6 | +Using Apache Spark with CrateDB is a powerful combination for processing and analyzing large datasets. In this usage guide, we'll walk through the process of setting up PySpark (Python API for Spark) to work with CrateDB, including data loading, processing, and writing results back to CrateDB. |
7 | 7 |
|
8 | 8 | Prerequisites: |
9 | 9 |
|
|
39 | 39 |
|
40 | 40 | ## Set up Apache Spark |
41 | 41 |
|
42 | | -This tutorial will work with a single-node Apache Spark installation running on a Mac M1 machine. To set up Apache Spark on your machine use the following steps: |
| 42 | +This usage guide will work with a single-node Apache Spark installation running on a Mac M1 machine. To set up Apache Spark on your machine use the following steps: |
43 | 43 |
|
44 | 44 | 1. Install Java and Scala, as the Apache Spark requires both to run: |
45 | 45 |
|
46 | | -`brew install openjdk@11 |
47 | | -brew install scala` |
| 46 | + ```shell |
| 47 | + brew install openjdk@11 |
| 48 | + brew install scala |
| 49 | + ``` |
48 | 50 |
|
49 | | -Before verifying your Java installation, set the `JAVA_HOME` environment variable by adding the following line to your shell profile: |
| 51 | + Before verifying your Java installation, set the `JAVA_HOME` environment variable by adding the following line to your shell profile: |
50 | 52 |
|
51 | | -`export JAVA_HOME="/usr/local/opt/openjdk@11"` |
| 53 | + `export JAVA_HOME="/usr/local/opt/openjdk@11"` |
52 | 54 |
|
53 | | -2. To install the latest version of Apache Spark (which includes PySpark) run: |
| 55 | +2. Install the latest version of Apache Spark (which includes PySpark): |
54 | 56 |
|
55 | | -`brew install apache-spark` |
| 57 | + ```shell |
| 58 | + brew install apache-spark |
| 59 | + ``` |
56 | 60 |
|
57 | 61 | 3. Verify the installation of apache-spark and pyspark: |
58 | 62 |
|
59 | | -`spark-shell --version ` |
60 | | -`pyspark --version ` |
| 63 | + ```shell |
| 64 | + spark-shell --version |
| 65 | + pyspark --version |
| 66 | + ``` |
61 | 67 |
|
62 | | -4. Finally, as CrateDB communicates with Spark via JDBC, download the [Postgres JDBC driver](https://jdbc.postgresql.org/download/) in your working directory. In this tutorial, we use the `postgresql-42.6.0.jar` driver. |
| 68 | +4. Finally, as CrateDB communicates with Spark via JDBC, download the [Postgres JDBC driver](https://jdbc.postgresql.org/download/) in your working directory. In this usage guide, we use the `postgresql-42.6.0.jar` driver. |
63 | 69 |
|
64 | 70 |
|
65 | 71 | ## Data analysis |
|
0 commit comments