David J. Birnbaum and Ronald Haentjens Dekker
Last revised 2018-08-13
CollateX has a small number of dependencies that cause problems for some users. Distributing CollateX in a Docker container means that the dependencies can be packaged with it, and do not need to be installed separately by the user. Specifically:
- CollateX uses the python-Levenshtein package to support near matching. This package is a C library that is built on the local system, and not all users will have installed the compilers the build requires.
The following are historical dependencies that are not an issue with CollateX beginning with 2.1.3rc2, but that may cause problems for users of earlier versions:
- CollateX before 2.1.3rc2 required version 1.11 of the NetworkX library, and NetworkX beginning with 2.0 uses a different API, which was incompatible with those earlier versions of CollateX. By running CollateX inside a Docker container, users who needed later versions of NetworkX for other purposes could avoid having to downgrade their general Python library installations or create a separate Python environment just to run CollateX. CollateX beginning with 2.1.3rc2 is compatible with NetworkX 2.0 and later, and therefore requires no special accommodations.
- CollateX before 2.1.3rc2 used the PyGraphviz package to support SVG output of the variant graph. Like python-Levenshtein, PyGraphviz has to be compiled on the local machine, which was problematic for users who do not have compiler tools installed (typically on Windows). CollateX beginning with 2.1.3rc2 replaces PyGraphviz with the Python Graphviz package, which can be installed without compiler tools.
Check the “What to know before you install” section at Docker for Mac (CE) or Docker for Windows (CE). If you meet these requirements, install the stable channel release of Docker CE. If not, install Docker toolbox instead. Linux versions for different distributions are available from links at https://www.docker.com/community-edition#/download.
Following Docker installation, Windows users are likely to be prompted by Docker to enable Hypercard-V in order to provide the virtualization on which Docker relies. The first time you try to start Docker after a reboot, you may see an error message prompting you to enable hardware-assisted virtualization in your BIOS. First read more about virtualization and how to check your settings here. Accessing your BIOS depends on which company designed your motherboard (its hardware) and the version of Windows you are running. (For example, on one tester’s computer running Windows 10 Educational, the BIOS was designed by the ASUS company to work with an Intel Xeon 2.4 Ghz processor. This information was needed to look up how to enable virtualization on this particular machine. In this case, two settings needed to be changed to “enabled”: Intel Virtualization Tech and IntelVT for Directed I/O.) The specific settings and their location in the BIOS will vary considerably, and you may need to test this a few times in order to get Docker to run.
Copy (or download) the following text into a file called Dockerfile in an otherwise empty directory:
FROM jupyter/datascience-notebook
USER root
COPY start-notebook.sh '/usr/local/bin'
RUN apt-get -y update && apt-get install -y \
graphviz \
libgraphviz-dev \
graphviz-dev \
pkg-config \
tofrodos \
&& rm -rf /var/lib/apt/lists/* \
&& fromdos '/usr/local/bin/start-notebook.sh' \
&& chmod +x '/usr/local/bin/start-notebook.sh'
USER jovyan
RUN pip install --upgrade --pre collatex \
&& pip install python-levenshtein \
&& pip install graphviz
CMD ["start-notebook.sh"]Copy (or download) the following text into a file called start-notebook.sh in the same directory:
#!/bin/bash
exec jupyter notebook --NotebookApp.token='' &> /dev/null &
exec bashNote: The user created by the Jupyter base image has userid jovyan.
Create an image by executing the following command in that directory (note that the line ends in a space followed by a dot):
docker build -t collatex .
This may take a long time, but you only have to do it once, and unless you raise an error, you can ignore the messages that will scroll down your screen. If the build process errors out with a “context canceled” message, start it again and it should pick up where it left off. After you’ve completed the build, you can then run the image, when needed, without rebuilding each time.
Inside the directory where you are configuring the container, run the following command
mkdir workNormally information on the local file system is not accessible inside the container, and files written inside the container disappear when the container exits. We create and configure a work directory to hold persistent data, that is, pre-existing local files that we want to be accessible inside the container, as well as files created inside the container that we want to remain accessible on the local file system after the container exits.
Run the image by executing the following command:
docker run -it -p 8888:8888 --rm -v '/Users/djb/collatex-docker/work:/home/jovyan/work' collatex
Notes:
- Windows users must follow the instructions at https://rominirani.com/docker-on-windows-mounting-host-directories-d96f3f056a2c in order to mount the local volume inside the container.
- All users must change the argument to the
-vswitch so that the part before the colon is a full path to a directory that exists on their local file systems. In the section above, we created aworkdirectory in a specific location for that purpose, but you can mount any local directory instead. The part after the colon doesn’t change; whatever directory you specify will be accessible inside the container at the address/home/jovyan/work. - If you are using port 8888 for another purpose on your host machine, change the number before the colon in the argument to the
-pswitch. For example, to access the notebook at http://localhost:8889, use-p 8889:8888. Do not change the number after the colon.
The command above does the following:
- Starts a container from your Docker image.
- Deposits you at the command line of a Unix virtual machine, where you will be logged in as userid
jovyanat/home/jovyan. You can then start an interactive Python session and use CollateX as you normally would. - Starts a Jupyter notebook server inside the container, which you can access from your local machine at http://localhost:8888.
- Mounts the local directory
/Users/djb/collatex-docker/workinside the container as/home/jovyan/work. Anything already in that directory when you launch the container will be accessible inside the container, and anything you write into that directory while inside the container will remain accessible on the host machine after the container exits.
Inside the container, start a Python session and run:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The quick brown fox jumps over the dog.")
collation.add_plain_witness("B", "The brown fox jumps over the lazy dog.")
alignment_table = collate(collation)
print(alignment_table)You should see:
+---+-----+-------+--------------------------+------+------+
| A | The | quick | brown fox jumps over the | - | dog. |
| B | The | - | brown fox jumps over the | lazy | dog. |
+---+-----+-------+--------------------------+------+------+
Inside a cell, run:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The quick brown fox jumps over the dog.")
collation.add_plain_witness("B", "The brown fox jumps over the lazy dog.")
alignment_table = collate(collation)
print(alignment_table)
collate(collation,output="svg")You should see the same alignment table as above, followed by:
Building Docker images may create intermediate images or containers (instances of images) that do not remove themselves cleanly when they are no longer needed. These are harmless, but messy, and they do take up disk space. The following commands will help manage them. Note that before you remove an image you need to remove any containers that refer to it.
| Command | What it does |
|---|---|
docker ps -a |
list all containers |
docker rm <container-id> |
remove the container |
docker images -a |
list all images |
docker rmi <image-id> |
remove the image |
- Documentation for the Jupyter base image is available at https://hub.docker.com/r/jupyter/datascience-notebook/.
- The strategy for starting both a notebook server and an interactive command line simultaneously is partially based on https://stackoverflow.com/questions/34865097/run-jupyter-notebook-in-the-background-on-docker.
- The
docker runcommand uses the following arguments (the explanation below is derived from https://djangostars.com/blog/what-is-docker-and-how-to-use-it-with-python/):-tflag assigns a pseudo-tty or terminal inside the new container.-iflag allows you to make an interactive connection by grabbing the standard input (STDIN) of the container.--rmflag automatically removes the container when the process exits. By default, containers are not deleted. This container exists as long as the shell session is active, and terminates when we exit from the session.-v '/Users/djb/collatex-docker/work:/home/jovyan/work'makes the/Users/djb/collatex-docker/workdirectory accessible inside the container as/home/jovyan/work. Before running the command, you must change the part before the colon to the full path to an existing directory on your own local filesystem.
