Submitting Spark applications is typically very environmentally dependent, as the Spark application is often packaged up as a JAR without all of the dependencies it needs, expecting the class path provided by spark-submit to contain the rest. The "rest", typically includes Spark and Hadoop JARs, but might include other libraries like delta. In addition to dependency availability, the dependency versioning can also be problematic, such as between Spark / Hadoop versions. Providing a stable environment via a Docker image can eliminate a lot of these problems, which can be convenient in certain situations.
This is an example of how you can ship a basic dockerized Spark application that only has Docker as a prerequisite.
The example application just writes some CSV data partitions into /var/data/data.csv.
- Docker
docker build -t dockerized-spark-submit-example
This has been provided as a build.sh for convenience.
The Dockerfile is setup in such a way that this will just run.
Presuming you have built the image and tagged it accordingly, then you can run the following.
docker run dockerize-spark-submit-example:latest
However this will write data in the container. To write the data to the host, we can create a directory $PWD/data as a bind-mount at /var/data.
Here's an example of docker run with the bind-mount.
docker run -v $PWD/data:/var/data dockerize-spark-submit-example:latest
See run.sh for a full version of a run specifying a full user-provided spark-submit command with arguments.
We want to provide a container image that has
- Spark (libraries and executables)
- Java
- Our application
The Dockerfile contains a multi-stage build with two stages, where the first stage provides us with an area to build our application, to reduce the size of the final image.
-
The first stage:
- We use a base image that will support packaging our JAR with
sbt. - We download
wgetso we can download a Spark distribution into the image. - We download the Spark distribution and unarchive it into
/opt/spark. - We run
sbt packageto package our app as a thin JAR, which will write intotarget/scala-2.12/<artifact-name>.jaras per ourbuild.sbtfile. If we had more dependencies besides just Spark, we might package up as an uber JAR withsbt assemblyfor convenience.
- We use a base image that will support packaging our JAR with
-
The second stage:
- At this point we have Spark available and our application packaged, now we just setup the final image by copying it all over onto a basic Java image.
- We install
tinias a general best practice for signal handling. - We setup some environment variables for convenience when running our app.
- We expose port 4040 for the Spark UI.
- We copy over the Spark distribution, and the app JAR.
- We set our entrypoint as
tinito provide a base for running your own variation ofspark-submitif needed. - We set the default
CMDto runspark-submit, referencing our app JAR and main-class, as well as theMASTERprovided as an environment variable that the user could configure if desired. The user can then either override theMASTERor provide their own entirespark-submitcommand.- The reason we use shell form (
sh -c) here is so we can get shell variable expansion for theMASTERenvironment variable. In exec form, Docker won't use a shell to executeCMDso variable expansion won't work. See the docs here.
- The reason we use shell form (