Skip to content

Commit 267be76

Browse files
[Refactor:Plagiarism] Run Lichen in a Docker container (#83)
* Dockerize lichen * test GH actions with docker * Update setup.sh * Install Python dependencies for tokenizer tests * Break unit tests and integration tests into two separate parts * Fix paths for unit tests * Fix path for integration tests * Fix invalid GH Action * change python version * a few more fixes * change permissions for pip packages * Update python_unittests.yml * debugging * Update test.py * debugging * Update python_unittests.yml * install clang * Update python_unittests.yml * Update python_unittests.yml * Update python_unittests.yml * need specific version of clang... * Install Python dependencies outside the Docker container as well * remove sudo * Update setup.sh * attempt to get permissions working..... * Update install_lichen.sh * Compile code outside Docker container * pull or build flag * use correct username for docker hub * A few more fixes * Add boost to container * Requested changes * Remove buggy environment variable
1 parent 4b5113c commit 267be76

File tree

18 files changed

+219
-254
lines changed

18 files changed

+219
-254
lines changed

.github/workflows/lichen_ci.yml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
name: Lichen CI
2+
3+
on: [push, pull_request]
4+
5+
env:
6+
PYTHON_VERSION: 3.8
7+
8+
jobs:
9+
python-unit-tests:
10+
runs-on: ubuntu-20.04
11+
steps:
12+
- uses: actions/checkout@v2
13+
- uses: actions/setup-python@v2
14+
with:
15+
python-version: ${{ env.PYTHON_VERSION }}
16+
- name: Install Python Dependencies
17+
run: |
18+
pip install -r requirements.txt
19+
- name: Install Tokenizer Dependencies
20+
run: |
21+
sudo apt-get update
22+
sudo apt-get install -y clang-6.0
23+
- name: Run Unit Tests
24+
run: |
25+
cd tests/unittest
26+
python3 -m unittest discover
27+
28+
test-lichen-integration:
29+
runs-on: ubuntu-20.04
30+
steps:
31+
- uses: actions/checkout@v2
32+
- name: Install Lichen
33+
run: |
34+
sudo bash ./tests/integration/setup.sh
35+
- name: Run Integration Tests
36+
run: |
37+
cd /usr/local/submitty/Lichen/tests/integration
38+
sudo python3 -m unittest discover

.github/workflows/lichen_run.yml

Lines changed: 0 additions & 23 deletions
This file was deleted.

.github/workflows/pylint.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ on: [push]
44

55
jobs:
66
python-lint:
7-
runs-on: ubuntu-18.04
7+
runs-on: ubuntu-20.04
88
steps:
99
- uses: actions/checkout@v2
1010
- uses: actions/setup-python@v2
1111
with:
12-
python-version: '3.6'
12+
python-version: '3.9'
1313
- name: Cache Pip
1414
uses: actions/cache@v2
1515
with:

Dockerfile

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
FROM ubuntu:20.04
2+
3+
ARG DEBIAN_FRONTEND=noninteractive
4+
5+
# C++ and Python
6+
RUN apt-get update \
7+
&& apt-get install -y \
8+
libboost-all-dev \
9+
python3.10 \
10+
python3-pip
11+
12+
# Python Dependencies
13+
COPY requirements.txt /Lichen/requirements.txt
14+
RUN pip install -r /Lichen/requirements.txt
15+
16+
# The script we run on startup
17+
CMD ["/Lichen/bin/process_all.sh"]

bin/concatenate_all.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -148,7 +148,8 @@ def validate(config, args):
148148
other_gradeables = config["other_gradeables"]
149149

150150
# Check we have a tokenizer to support the configured language
151-
langs_data_json_path = "./data.json" # data.json is in the Lichen/bin directory after install
151+
langs_data_json_path = Path(Path(__file__).resolve().parent.parent,
152+
"tokenizer", "tokenizer_config.json")
152153
with open(langs_data_json_path, 'r') as langs_data_file:
153154
langs_data = json.load(langs_data_file)
154155
if language not in langs_data:

bin/process_all.sh

Lines changed: 4 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,5 @@
11
#!/bin/sh
22

3-
# This script is the startup script for Lichen. It accepts a single path to a
4-
# directory containing a config file and creates the necessary output directories
5-
# as appropriate, relative to the provided path. It is possible to run this script
6-
# from the command line but it is meant to be run via the Plagiarism Detection UI.
7-
8-
# TODO: Assert permissions, as necessary
9-
10-
BASEPATH=$1 # holds the path to a directory containing a config for this gradeable
11-
# (probably .../lichen/gradeable/<unique number>/ on Submitty)
12-
13-
DATAPATH=$2 # holds the path to a directory conatining courses and their data
14-
# (probably /var/local/submitty/courses on Submitty)
15-
163
KILL_ERROR_MESSAGE="
174
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
185
* An error occured while running Lichen. Your run was probably killed for *
@@ -27,82 +14,7 @@ KILL_ERROR_MESSAGE="
2714
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
2815
";
2916

30-
# kill the script if there is no config file
31-
if [ ! -f "${BASEPATH}/config.json" ]; then
32-
echo "Unable to find config.json in provided directory"
33-
exit 1
34-
fi
35-
36-
37-
# delete any previous run results
38-
# TODO: determine if any caching should occur
39-
rm -rf "${BASEPATH}/logs"
40-
rm -rf "${BASEPATH}/other_gradeables"
41-
rm -rf "${BASEPATH}/users"
42-
rm -f "${BASEPATH}/overall_ranking.txt"
43-
rm -f "${BASEPATH}/provided_code/submission.concatenated"
44-
rm -f "${BASEPATH}/provided_code/tokens.json"
45-
rm -f "${BASEPATH}/provided_code/hashes.txt"
46-
47-
# create these directories if they don't already exist
48-
mkdir -p "${BASEPATH}/logs"
49-
mkdir -p "${BASEPATH}/provided_code"
50-
mkdir -p "${BASEPATH}/provided_code/files"
51-
mkdir -p "${BASEPATH}/other_gradeables"
52-
mkdir -p "${BASEPATH}/users"
53-
54-
# Run Lichen and exit if an error occurs
55-
{
56-
############################################################################
57-
# Finish setting up Lichen run
58-
59-
# The default is r-x and we need PHP to be able to write if edits are made to the provided code
60-
chmod g=rwxs "${BASEPATH}/provided_code/files" || exit 1
61-
62-
cd "$(dirname "${0}")" || exit 1
63-
64-
############################################################################
65-
# Do some preprocessing
66-
echo "Beginning Lichen run: $(date +"%Y-%m-%d %H:%M:%S")"
67-
./concatenate_all.py "$BASEPATH" "$DATAPATH" || exit 1
68-
69-
############################################################################
70-
# Move the file somewhere to be processed (eventually this will be a worker machine)
71-
72-
# Tar+zip the file structure and save it to /tmp
73-
cd $BASEPATH || exit 1
74-
archive_name=$(sha1sum "${BASEPATH}/config.json" | awk '{ print $1 }') || exit 1
75-
tar -czf "/tmp/LICHEN_JOB_${archive_name}.tar.gz" "config.json" "other_gradeables" "users" "provided_code" || exit 1
76-
cd "$(dirname "${0}")" || exit 1
77-
78-
# TODO: move the archive to worker machine for processing
79-
80-
# Extract archive
81-
tmp_location="/tmp/LICHEN_JOB_${archive_name}"
82-
mkdir $tmp_location || exit 1
83-
tar -xzf "/tmp/LICHEN_JOB_${archive_name}.tar.gz" -C "$tmp_location"
84-
rm "/tmp/LICHEN_JOB_${archive_name}.tar.gz" || exit 1
85-
86-
############################################################################
87-
# Run Lichen
88-
{ # We still want to unzip files if an error occurs when running Lichen here
89-
./tokenize_all.py "$tmp_location" &&
90-
./hash_all.py "$tmp_location" &&
91-
./compare_hashes.out "$tmp_location" || echo "${KILL_ERROR_MESSAGE}" &&
92-
./similarity_ranking.py "$tmp_location";
93-
}
94-
95-
############################################################################
96-
# Zip the results back up and send them back to the course's lichen directory
97-
cd $tmp_location || exit 1
98-
tar -czf "/tmp/LICHEN_JOB_${archive_name}.tar.gz" "."
99-
rm -rf "$tmp_location" || exit 1
100-
101-
# TODO: Move the archive back from worker machine
102-
103-
# Extract archive and restore Lichen file structure
104-
cd "$BASEPATH" || exit 1
105-
tar --skip-old-files -xzf "/tmp/LICHEN_JOB_${archive_name}.tar.gz" -C "$BASEPATH"
106-
rm "/tmp/LICHEN_JOB_${archive_name}.tar.gz" || exit 1
107-
108-
} >> "${BASEPATH}/logs/lichen_job_output.txt" 2>&1
17+
python3 /Lichen/tokenizer/tokenize_all.py "/data" || exit 1
18+
python3 /Lichen/hasher/hash_all.py "/data" || exit 1
19+
/Lichen/compare_hashes/compare_hashes.out "/data" || { echo "${KILL_ERROR_MESSAGE}"; exit 1; }
20+
python3 /Lichen/similarity_ranking/similarity_ranking.py "/data";

bin/run_lichen.sh

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
#!/bin/sh
2+
3+
# This script is the startup script for Lichen. It accepts a single path to a
4+
# directory containing a config file and creates the necessary output directories
5+
# as appropriate, relative to the provided path. It is possible to run this script
6+
# from the command line but it is meant to be run via the Plagiarism Detection UI.
7+
8+
# TODO: Assert permissions, as necessary
9+
10+
if [ "$#" -ne 2 ] || ! [ -d "$1" ] || ! [ -d "$2" ]; then
11+
echo "Usage: $0 <basepath> <datapath>" >&2
12+
exit 1
13+
fi
14+
15+
BASEPATH="$1" # holds the path to a directory containing a config for this gradeable
16+
# (probably .../lichen/gradeable/<unique number>/ on Submitty)
17+
18+
DATAPATH="$2" # holds the path to a directory conatining courses and their data
19+
# (probably /var/local/submitty/courses on Submitty)
20+
21+
LICHEN_INSTALLATION_DIR=/usr/local/submitty/Lichen
22+
23+
# kill the script if there is no config file
24+
if [ ! -f "${BASEPATH}/config.json" ]; then
25+
echo "Unable to find config.json in provided directory"
26+
exit 1
27+
fi
28+
29+
30+
# delete any previous run results
31+
# TODO: determine if any caching should occur
32+
rm -rf "${BASEPATH}/logs"
33+
rm -rf "${BASEPATH}/other_gradeables"
34+
rm -rf "${BASEPATH}/users"
35+
rm -f "${BASEPATH}/overall_ranking.txt"
36+
rm -f "${BASEPATH}/provided_code/submission.concatenated"
37+
rm -f "${BASEPATH}/provided_code/tokens.json"
38+
rm -f "${BASEPATH}/provided_code/hashes.txt"
39+
40+
# create these directories if they don't already exist
41+
mkdir -p "${BASEPATH}/logs"
42+
mkdir -p "${BASEPATH}/provided_code"
43+
mkdir -p "${BASEPATH}/provided_code/files"
44+
mkdir -p "${BASEPATH}/other_gradeables"
45+
mkdir -p "${BASEPATH}/users"
46+
47+
# Run Lichen and exit if an error occurs
48+
{
49+
############################################################################
50+
# Finish setting up Lichen run
51+
52+
# The default is r-x and we need PHP to be able to write if edits are made to the provided code
53+
chmod g=rwxs "${BASEPATH}/provided_code/files" || exit 1
54+
55+
cd "$(dirname "${0}")" || exit 1
56+
57+
############################################################################
58+
# Do some preprocessing
59+
echo "Beginning Lichen run: $(date +"%Y-%m-%d %H:%M:%S")"
60+
python3 concatenate_all.py "$BASEPATH" "$DATAPATH" || exit 1
61+
62+
############################################################################
63+
# Run Lichen
64+
65+
docker run -v "${BASEPATH}":/data -v "${LICHEN_INSTALLATION_DIR}":/Lichen submitty/lichen
66+
67+
############################################################################
68+
echo "Lichen run complete: $(date +"%Y-%m-%d %H:%M:%S")"
69+
} >> "${BASEPATH}/logs/lichen_job_output.txt" 2>&1

compare_hashes/compare_hashes.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ int main(int argc, char* argv[]) {
8686

8787
// ===========================================================================
8888
// load Lichen config data
89-
std::ifstream lichen_config_istr("./lichen_config.json");
89+
std::ifstream lichen_config_istr(boost::filesystem::path(boost::filesystem::system_complete(argv[0]).parent_path().parent_path() / "bin/lichen_config.json").string());
9090
assert(lichen_config_istr.good());
9191
nlohmann::json lichen_config = nlohmann::json::parse(lichen_config_istr);
9292
LichenConfig config;

bin/hash_all.py renamed to hasher/hash_all.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@ def hasher(lichen_config, lichen_run_config, my_tokenized_file, my_hashes_file):
2424
language = lichen_run_config["language"]
2525
hash_size = int(lichen_run_config["hash_size"])
2626

27-
data_json_path = "./data.json" # data.json is in the Lichen/bin directory after install
27+
data_json_path = Path(Path(__file__).resolve().parent.parent,
28+
"tokenizer", "tokenizer_config.json")
2829
with open(data_json_path) as token_data_file:
2930
token_data = json.load(token_data_file)
3031

@@ -55,7 +56,8 @@ def main():
5556
with open(Path(args.basepath, "config.json")) as lichen_run_config_file:
5657
lichen_run_config = json.load(lichen_run_config_file)
5758

58-
with open(Path(__file__).resolve().parent / "lichen_config.json") as lichen_config_file:
59+
with open(Path(Path(__file__).resolve().parent.parent,
60+
'bin', 'lichen_config.json')) as lichen_config_file:
5961
lichen_config = json.load(lichen_config_file)
6062

6163
print("HASH ALL:", flush="True")

0 commit comments

Comments
 (0)