Dockerized `spark-submit` Example

Submitting Spark applications is typically very environmentally dependent, as the Spark application is often packaged up as a JAR without all of the dependencies it needs, expecting the class path provided by spark-submit to contain the rest. The "rest", typically includes Spark and Hadoop JARs, but might include other libraries like delta. In addition to dependency availability, the dependency versioning can also be problematic, such as between Spark / Hadoop versions. Providing a stable environment via a Docker image can eliminate a lot of these problems, which can be convenient in certain situations.

This is an example of how you can ship a basic dockerized Spark application that only has Docker as a prerequisite.

The example application just writes some CSV data partitions into /var/data/data.csv.

Prerequisites

Docker

Building

docker build -t dockerized-spark-submit-example

This has been provided as a build.sh for convenience.

Running

The Dockerfile is setup in such a way that this will just run.

Presuming you have built the image and tagged it accordingly, then you can run the following.

docker run dockerize-spark-submit-example:latest

However this will write data in the container. To write the data to the host, we can create a directory $PWD/data as a bind-mount at /var/data.

Here's an example of docker run with the bind-mount.

docker run -v $PWD/data:/var/data dockerize-spark-submit-example:latest

See run.sh for a full version of a run specifying a full user-provided spark-submit command with arguments.

Dockerfile Explanation

We want to provide a container image that has

Spark (libraries and executables)
Java
Our application

The Dockerfile contains a multi-stage build with two stages, where the first stage provides us with an area to build our application, to reduce the size of the final image.

The first stage:
1. We use a base image that will support packaging our JAR with sbt.
2. We download wget so we can download a Spark distribution into the image.
3. We download the Spark distribution and unarchive it into /opt/spark.
4. We run sbt package to package our app as a thin JAR, which will write into target/scala-2.12/<artifact-name>.jar as per our build.sbt file. If we had more dependencies besides just Spark, we might package up as an uber JAR with sbt assembly for convenience.
The second stage:
1. At this point we have Spark available and our application packaged, now we just setup the final image by copying it all over onto a basic Java image.
2. We install tini as a general best practice for signal handling.
3. We setup some environment variables for convenience when running our app.
4. We expose port 4040 for the Spark UI.
5. We copy over the Spark distribution, and the app JAR.
6. We set our entrypoint as tini to provide a base for running your own variation of spark-submit if needed.
7. We set the default CMD to run spark-submit, referencing our app JAR and main-class, as well as the MASTER provided as an environment variable that the user could configure if desired. The user can then either override the MASTER or provide their own entire spark-submit command.
  - The reason we use shell form (sh -c) here is so we can get shell variable expansion for the MASTER environment variable. In exec form, Docker won't use a shell to execute CMD so variable expansion won't work. See the docs here.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
project		project
src/main/scala/net/lukeknight		src/main/scala/net/lukeknight
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
Dockerfile		Dockerfile
README.md		README.md
build.sbt		build.sbt
build.sh		build.sh
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dockerized `spark-submit` Example

Prerequisites

Building

Running

Dockerfile Explanation

About

Uh oh!

Releases

Packages

Languages

lukeknxt/dockerized-spark-submit-example

Folders and files

Latest commit

History

Repository files navigation

Dockerized spark-submit Example

Prerequisites

Building

Running

Dockerfile Explanation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Dockerized `spark-submit` Example

Packages