Before starting anything we have to install the dependencies needed to run the project, using one of these two ways
in the GLVN_project path, create a fresh environement with the dependencies we want using this command :
conda env create -f environment.yml
in the GLVN_project path, create a fresh environement with the dependencies we want using the following commands :
python3 -m venv my_env
source my_env/bin/activate
pip install -r requirements.txt
IMPORTANT : whether you used method 1.1 or 1.2, you must make absolutely sure that the new environement you created is set to the default environement, this project relies on 'screens' and launching processes in separate sessions so when those processes wake up, they must be in the same environement we are preparing right now, so just set it to default.
GLVN or (Generalization to Longer Variable Names) project consists of 3 sequential steps, all linked with a single script
NOTE : if you want to do a test run , go to title 3 directly, it will show an example on conducting data generation followed by training and finally evaluation.
this step basically hosts all data generation related scripts, we need it mainly for :
- generating the training dataset (in distribution)
- generating the test dataset (in distribution and out of distribution)
- saving all generated datasets in subfolders under
data_gen
a single run of data_gen produces a folder containing the following :
raw_id.txtfile containing all generated snippetsstats.txtfile containing general stats about theraw_id.txtfile (distribution of variable length, how many generated snippets ..etc) (some numbers are not representative of the actual dataset because they have been zeroed out for debugging reasons)determinism_filtered_snippets.txtfile containing snippets that passed the determinsim filtering testoversize_snippets.txtfile containing snippets that did not pass the determinsim filtering testtrain.bina fragment of the dataset mainly used for training (binary file)test.bin/test.txta fragement of the dataset mainly used for testing (binary and text files)val.bin/val.txta fragement of the datset mainly used for validation (binary and text files)vocab_size.txtvocabulary size written as a single number in a text file
when script.py calls a single run for the data_gen part, this is what happens in a sequence :
tinypy_code_tracing_generator_parallel.pygenerates one big txt file containing all snippets following specific generation rules, produced files :raw_id.txt,stats.txtdeterminism_filtering_parallel.pyscans throughraw_id.txtand picks only snippets that pass the determisim filtering test, produced files :determinism_filtered_snippets.txt,oversize_snippets.txtdata_preparation_CB_parallel.py: fromdeterminism_filtered_snippets.txt, this script fragments the files into train, validation and test sets, note that all of them follow the same distribution, that is same genreation rules, splittig is done according to rules set in that same .py file, produced files :train.bin,test.bin/test.txt,val.bin/val.txt,vocab_size.txt
Note : hyperparameters like dataset size or some generation rules are accessible in the script.py file
this step will be responsible of training a language model from scratch on a generated dataset, it contains a single script : optimus_train_new.py
this script takes a dataset folder generated in the previous step (2.1) and trains a language model from scratch, so if the name of the dataset is dataset1,
the script would use ./data_gen/dataset1/train.bin as the training binary file and ./data_gen/dataset1/val.bin as validation file.
produced files : a folder containing all the saved checkopoints during training + a best-model.pth file which points to the best checkpoint
Note : hyperparameters like seeds, number of checkpoints, gpus used ..etc are accessible in the script.py file
this step will take the checkpoints produced in model_train and evaluate them on a set of specified test datasets, it can be used for example :
- to test on in-distribution data, if the model was trained on
./data_gen/dataset1/train.binthen we would evaluate on./data_gen/dataset1/test.txt - to test on out-of-distribution data, if the model was trained on
./data_gen/dataset1/train.binthen we would evaluate on./data_gen/NOTdataset1/test.txt
the output is a folder containing a list of subfolders each coresponding to an evaluation of a specific checkpoint on a specific test dataset
the folder contains the following scripts :
simple_arch.py: an atomic evaluation script that evaluates a single checkpoint on a single datasetscheduler.py: an orchestrator that launches manysimple_arch.pyprocesses in parallel, to take into account multiple checkpoints or multiple test sets
Note : hyperparameters like which test datasets to use, what gpus to use ..etc are accessible in the script.py file
IMPORTANT : although script.py contains a handful of hyperparameters governing the execution of the 3 big steps, not all of them are present in this file, specific hyperparameters, like :number of epochs, learning rate ..etc, are present in the specific relevant scripts in each of the 3 separate folders.
This script is the control center that operates all 3 major steps, all is needed is to set the hyperaparameters accordingly, then from the GLVN_project folder, run : python script.py
Hyperparameters explained
# data generation :
MAX_DIGIT_COUNT : generation rule that controls the max length of a number (example : 3 means any number from -999 to 999)
MAX_CHAR_COUNT : genreation rule that controls the max length of a variable name (so a value of 10 means that variable names cannot exceed the length of 10 characters)
MIN_CHAR_COUNT : genreation rule that controls the min length of a variable name (so a value of 5 means that variable names cannot be under the length of 5 characters)
MAX_NESTING_DEPTH : genreation rule that controls the max depth of a snippet (so a value of 2 means that if-blocks cannot exceed 2 indentations)
MIN_LINE_COUNT : generation rule that controls the min number of statmenets a snippet can have
MAX_LINE_COUNT : generation rule that controls the max number of statmenets a snippet can have
OUTPUT_DATASET_NAME : a string containing the name of a specific dataset version, for example 'DS1' would save the dataset in the folder : './data_gen/DS1/'
DATA_SIZE : number of snippets to be generation (example 24_000_000)
NB_PROCESSES : number of processes to use in parallel, more means faster generation but check CPU capabilities first (on kindi machine i use 60 processes)
# training :
DEVICE_IDS : array of gpu ids, for example if i want to use 2 gpus (0 and 1) then i would write [0,1] (i use 4 gpus)
NB_CHECKPOINTS : number of checkpoints saved during training, more is better to visualize progress of the accuracy with more data used, it gives a better resolution in general (especially to save the best-model.pth checkpoint, it would be more accurate with more checkpoints), a rule of thumb i use is 1 checkpoint per 1 million snippet, so you can do the math
TRAIN_DATASET_NAME : string that points the training script to which dataset to use as a training dataset
SEED : integer seed for reproducibility
# evaluation :
EVAL_DATASETS_NAMES : dataset names which we will use to test our model checkpoints (for example, if we put ['train_data', 'ood_data'], we would test on these two files : ['./data_gen/train_data/test.txt','./data_gen/ood_data/test.txt'])
EVAL_DATASETS_TITLES : evaluation titles used to be paired with the previous datsaet names, so if we use ['ID','OOD'] this means that the results of evaluation on 'train_data' will be labeled 'ID', and results of eval on 'ood_data' will be labeled 'OOD' in the folder
# the previous two lists must be of the same length
EVAL_ALL_CHECKPOINTS : if set to true, the evaluation will be conducted on all checkpoints saved during the training, if its set to false, the script will only test the 'best-model.pth' checkpoint
DEVICE_EVAL_IDS : specifies the gpu ids that will be used for evaluation, (only a single gpu will be used per evaluation, so we need more if we want to evaluate many checkpoints in parallel..)
# General (MOST IMPORTANT PART)
# the following 3 booleans serve as Gates to either launch or not launch a specific step in the project, useful if we want to do a specific thing only and not having the full data-train-eval process launched everytime ..
generate_data = True
train_model = True
evaluate_model = True
Important : for the third step of model evaluation, although the terminal might show that it has been executed succesfully, that does not necessarily mean that is done, in fact, screens in the background may be still running, and you might want to use screen -ls command to check whether evaluations are still running in the background or not
Scenario : we want to do the following :
- Train a model on 1 million snippets containing variable names between 1 and 10 characters long
- Save 10 checkpoints (so one checkpoints each 100k examples)
- evaluate the best checkpoint only on in distribution data (so following the same rules as the training data), and on another dataset that contains variable names between 11 and 15 chars long
This is how we will procede
STEP 1 : generate the two datasets
first we want to lock the training and eval gates, so we set these hyperparameters :
generate_data = True
train_model = False
evaluate_model = False
we first generate the id-training (named 'train_data') dataset by setting the following hyperparameters :
MAX_DIGIT_COUNT = 3
MAX_CHAR_COUNT = 10
MIN_CHAR_COUNT = 1
MAX_NESTING_DEPTH = 2
MIN_LINE_COUNT = 5
MAX_LINE_COUNT = 10
OUTPUT_DATASET_NAME = "train_data"
DATA_SIZE = 1_000_000
NB_PROCESSES = 10
we then run python script.py
after this run, we should have a folder under data_gen named train_data that contains both binary training data and the in-distribution test file
next, we generate the ood-dataset (named 'ood_data') dataset by setting the following hyperparameters :
MAX_DIGIT_COUNT = 3
MAX_CHAR_COUNT = 15
MIN_CHAR_COUNT = 11
MAX_NESTING_DEPTH = 2
MIN_LINE_COUNT = 5
MAX_LINE_COUNT = 10
OUTPUT_DATASET_NAME = "ood_data"
DATA_SIZE = 100_000
NB_PROCESSES = 10
we then run python script.py
after this run, we should have another folder under data_gen named ood_data that contains both binary training data and the out-of-distribution test file (we only care about the test file since we are going to train on the in-distribution dataset)
STEP 2 : training and testing
now that we have the dataset ready, we can do both training and evaluation at the same time !
so now we want to open the training and eval gates and close the data generation gate, by setting the following hyperparams
generate_data = False
train_model = True
evaluate_model = True
we then set the training and evaluation hyperparameters :
# training :
DEVICE_IDS = [0,1,2,3]
NB_CHECKPOINTS = 10
TRAIN_DATASET_NAME = 'train_data'
SEED = 1
# evaluation :
EVAL_DATASETS_NAMES = ['train_data', 'ood_data']
EVAL_DATASETS_TITLES = ['in_distribution', 'out_of_distribution']
# the previous two lists must be of the same length
EVAL_ALL_CHECKPOINTS = False
DEVICE_EVAL_IDS = [0,1,2,3]
notice how NB_CHECKPOINTS is set to 10, and that since we want to evaluate the best model on two test sets, we mentioned both test sets : ['train_data', 'ood_data'], and have set EVAL_ALL_CHECKPOINTS to False
we then run python script.py
and after training and eval is complete, we now have access to all the results in ./model_eval/eval_results/
FINAL NOTEs :
- in reality, all separate scripts inside of each STEP can be run independently, but to simplify things, we created
script.pyto launch all scripts from there without having to go through the hassle of manual process launching. This means that you can modify the scripts however you like and update visible/hidden hyperparameters in this project depending on your needs - the main terminal may not show enough info about the progress made during each step (data generation, training ..etc), so you can do a manual check by running first
screen -lsto list the steps currently running and thenscreen -r SCREEN_NAMEto see the live progress in the corresponding step, you can always exit a screen by holdingctrland in parallel clickingathend(your finger still onctrl) - do not make edits to
script.pymeanwhile the processes are running because internal scripts use the hyperparameters mentioned in the main script. - when generating data, do not use names of existings datasets, or the reserved name 'data'