This repository contains the Dockerfile and associated configurations for deploying Apache Spark in a standalone mode using Docker and for Kubernetes.
It re-uses the BERDL Docker image as a base, which provides a consistent environment for running Spark jobs and python applications.
- SPARK_PREFIX are official Spark environment variables, such as
SPARK_MASTER_URL,SPARK_DRIVER_HOST, etc. - BERDL_PREFIX are our specialized ones
-
Clone the repository
git clone git@github.com:kbase/cdm-spark-standalone.git cd cdm-spark-standalone -
Build the Docker image
docker compose up -d --build
-
Access the Spark UI:
- Spark Master: http://localhost:8090
- Spark Worker 1: http://localhost:8081
- Spark Worker 2: http://localhost:8082
To test the cluster is working:
-
Start a shell in the spark-user container:
docker compose exec -it spark-user bash -
Submit a test job:
spark_user@d30c26e91ae0:/opt/bitnami/spark$ bin/spark-submit --master $SPARK_MASTER_URL --deploy-mode client examples/src/main/python/pi.py 10
You should see a line like
Pi is roughly 3.138040
in the output.
-
Start a shell in the spark-user container:
docker compose exec -it spark-user bash -
Run the example:
spark-submit /app/redis_container_script.py
-
Start a shell in the Redis container:
docker compose exec -it redis bash -
Start the Redis CLI:
redis-cli
-
List all keys for your cached table:
keys people:* -
View the contents of a specific key (replace the key with one from the previous command):
hgetall people:d6d606a747ae40368fc7fdae784b835b