Skip to content

BERDataLakehouse/kube_spark_manager_image

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KBase Spark Standalone Deployment

This repository contains the Dockerfile and associated configurations for deploying Apache Spark in a standalone mode using Docker and for Kubernetes.

It re-uses the BERDL Docker image as a base, which provides a consistent environment for running Spark jobs and python applications.

ENV VARS

  • SPARK_PREFIX are official Spark environment variables, such as SPARK_MASTER_URL, SPARK_DRIVER_HOST, etc.
  • BERDL_PREFIX are our specialized ones

Getting Started

  1. Clone the repository

    git clone git@github.com:kbase/cdm-spark-standalone.git
    cd cdm-spark-standalone
  2. Build the Docker image

    docker compose up -d --build
  3. Access the Spark UI:

Testing the Spark Cluster

To test the cluster is working:

  1. Start a shell in the spark-user container:

    docker compose exec -it spark-user bash
  2. Submit a test job:

    spark_user@d30c26e91ae0:/opt/bitnami/spark$ bin/spark-submit --master $SPARK_MASTER_URL --deploy-mode client examples/src/main/python/pi.py 10
    

You should see a line like

Pi is roughly 3.138040

in the output.

Using Redis for Caching

  1. Start a shell in the spark-user container:

    docker compose exec -it spark-user bash
  2. Run the example:

    spark-submit /app/redis_container_script.py

Verifying Cache in Redis

  1. Start a shell in the Redis container:

    docker compose exec -it redis bash
  2. Start the Redis CLI:

    redis-cli
  3. List all keys for your cached table:

    keys people:*
  4. View the contents of a specific key (replace the key with one from the previous command):

    hgetall people:d6d606a747ae40368fc7fdae784b835b

About

Spark worker and master based on the BERDL Notebook base image

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 6