1. Environment setup

Before starting anything we have to install the dependencies needed to run the project, using one of these two ways

1.1 using conda's environement.yml upload

in the GLVN_project path, create a fresh environement with the dependencies we want using this command :
conda env create -f environment.yml

1.2 using python's venv (the requirements.txt way)

in the GLVN_project path, create a fresh environement with the dependencies we want using the following commands :

python3 -m venv my_env
source my_env/bin/activate
pip install -r requirements.txt

IMPORTANT : whether you used method 1.1 or 1.2, you must make absolutely sure that the new environement you created is set to the default environement, this project relies on 'screens' and launching processes in separate sessions so when those processes wake up, they must be in the same environement we are preparing right now, so just set it to default.

2. GLVN project

GLVN or (Generalization to Longer Variable Names) project consists of 3 sequential steps, all linked with a single script

NOTE : if you want to do a test run , go to title 3 directly, it will show an example on conducting data generation followed by training and finally evaluation.

2.1 `data_gen`

this step basically hosts all data generation related scripts, we need it mainly for :

generating the training dataset (in distribution)
generating the test dataset (in distribution and out of distribution)
saving all generated datasets in subfolders under data_gen

a single run of data_gen produces a folder containing the following :

raw_id.txt file containing all generated snippets
stats.txt file containing general stats about the raw_id.txt file (distribution of variable length, how many generated snippets ..etc) (some numbers are not representative of the actual dataset because they have been zeroed out for debugging reasons)
determinism_filtered_snippets.txt file containing snippets that passed the determinsim filtering test
oversize_snippets.txt file containing snippets that did not pass the determinsim filtering test
train.bin a fragment of the dataset mainly used for training (binary file)
test.bin/test.txt a fragement of the dataset mainly used for testing (binary and text files)
val.bin/val.txt a fragement of the datset mainly used for validation (binary and text files)
vocab_size.txt vocabulary size written as a single number in a text file

when script.py calls a single run for the data_gen part, this is what happens in a sequence :

tinypy_code_tracing_generator_parallel.py generates one big txt file containing all snippets following specific generation rules, produced files : raw_id.txt, stats.txt
determinism_filtering_parallel.py scans through raw_id.txt and picks only snippets that pass the determisim filtering test, produced files : determinism_filtered_snippets.txt, oversize_snippets.txt
data_preparation_CB_parallel.py : from determinism_filtered_snippets.txt, this script fragments the files into train, validation and test sets, note that all of them follow the same distribution, that is same genreation rules, splittig is done according to rules set in that same .py file, produced files : train.bin,test.bin/test.txt, val.bin/val.txt, vocab_size.txt

Note : hyperparameters like dataset size or some generation rules are accessible in the script.py file

2.2 `model_train`

this step will be responsible of training a language model from scratch on a generated dataset, it contains a single script : optimus_train_new.py

this script takes a dataset folder generated in the previous step (2.1) and trains a language model from scratch, so if the name of the dataset is dataset1, the script would use ./data_gen/dataset1/train.bin as the training binary file and ./data_gen/dataset1/val.bin as validation file.

produced files : a folder containing all the saved checkopoints during training + a best-model.pth file which points to the best checkpoint

Note : hyperparameters like seeds, number of checkpoints, gpus used ..etc are accessible in the script.py file

2.3 `model_eval`

this step will take the checkpoints produced in model_train and evaluate them on a set of specified test datasets, it can be used for example :

to test on in-distribution data, if the model was trained on ./data_gen/dataset1/train.bin then we would evaluate on ./data_gen/dataset1/test.txt
to test on out-of-distribution data, if the model was trained on ./data_gen/dataset1/train.bin then we would evaluate on ./data_gen/NOTdataset1/test.txt

the output is a folder containing a list of subfolders each coresponding to an evaluation of a specific checkpoint on a specific test dataset

the folder contains the following scripts :

simple_arch.py : an atomic evaluation script that evaluates a single checkpoint on a single dataset
scheduler.py : an orchestrator that launches many simple_arch.py processes in parallel, to take into account multiple checkpoints or multiple test sets

Note : hyperparameters like which test datasets to use, what gpus to use ..etc are accessible in the script.py file

IMPORTANT : although script.py contains a handful of hyperparameters governing the execution of the 3 big steps, not all of them are present in this file, specific hyperparameters, like :number of epochs, learning rate ..etc, are present in the specific relevant scripts in each of the 3 separate folders.

2.4 `script.py`

This script is the control center that operates all 3 major steps, all is needed is to set the hyperaparameters accordingly, then from the GLVN_project folder, run : python script.py

Hyperparameters explained

# data generation :
MAX_DIGIT_COUNT : generation rule that controls the max length of a number (example : 3 means any number from -999 to 999)

MAX_CHAR_COUNT : genreation rule that controls the max length of a variable name (so a value of 10 means that variable names cannot exceed the length of 10 characters)

MIN_CHAR_COUNT : genreation rule that controls the min length of a variable name (so a value of 5 means that variable names cannot be under the length of 5 characters)

MAX_NESTING_DEPTH : genreation rule that controls the max depth of a snippet (so a value of 2 means that if-blocks cannot exceed 2 indentations)

MIN_LINE_COUNT : generation rule that controls the min number of statmenets a snippet can have

MAX_LINE_COUNT : generation rule that controls the max number of statmenets a snippet can have

OUTPUT_DATASET_NAME : a string containing the name of a specific dataset version, for example 'DS1' would save the dataset in the folder : './data_gen/DS1/'

DATA_SIZE : number of snippets to be generation (example 24_000_000)

NB_PROCESSES : number of processes to use in parallel, more means faster generation but check CPU capabilities first (on kindi machine i use 60 processes)




# training :

DEVICE_IDS : array of gpu ids, for example if i want to use 2 gpus (0 and 1) then i would write [0,1] (i use 4 gpus)

NB_CHECKPOINTS : number of checkpoints saved during training, more is better to visualize progress of the accuracy with more data used, it gives a better resolution in general (especially to save the best-model.pth checkpoint, it would be more accurate with more checkpoints), a rule of thumb i use is 1 checkpoint per 1 million snippet, so you can do the math

TRAIN_DATASET_NAME : string that points the training script to which dataset to use as a training dataset

SEED : integer seed for reproducibility



# evaluation :

EVAL_DATASETS_NAMES : dataset names which we will use to test our model checkpoints (for example, if we put ['train_data', 'ood_data'], we would test on these two files : ['./data_gen/train_data/test.txt','./data_gen/ood_data/test.txt'])

EVAL_DATASETS_TITLES : evaluation titles used to be paired with the previous datsaet names, so if we use ['ID','OOD'] this means that the results of evaluation on 'train_data' will be labeled 'ID', and results of eval on 'ood_data' will be labeled 'OOD' in the folder

# the previous two lists must be of the same length

EVAL_ALL_CHECKPOINTS : if set to true, the evaluation will be conducted on all checkpoints saved during the training, if its set to false, the script will only test the 'best-model.pth' checkpoint


DEVICE_EVAL_IDS : specifies the gpu ids that will be used for evaluation, (only a single gpu will be used per evaluation, so we need more if we want to evaluate many checkpoints in parallel..)

# General (MOST IMPORTANT PART)
# the following 3 booleans serve as Gates to either launch or not launch a specific step in the project, useful if we want to do a specific thing only and not having the full data-train-eval process launched everytime ..

generate_data = True
train_model = True
evaluate_model = True

Important : for the third step of model evaluation, although the terminal might show that it has been executed succesfully, that does not necessarily mean that is done, in fact, screens in the background may be still running, and you might want to use screen -ls command to check whether evaluations are still running in the background or not

3. Example

Scenario : we want to do the following :

Train a model on 1 million snippets containing variable names between 1 and 10 characters long
Save 10 checkpoints (so one checkpoints each 100k examples)
evaluate the best checkpoint only on in distribution data (so following the same rules as the training data), and on another dataset that contains variable names between 11 and 15 chars long

This is how we will procede

STEP 1 : generate the two datasets

first we want to lock the training and eval gates, so we set these hyperparameters :

generate_data = True
train_model = False
evaluate_model = False

we first generate the id-training (named 'train_data') dataset by setting the following hyperparameters :

MAX_DIGIT_COUNT = 3
MAX_CHAR_COUNT = 10
MIN_CHAR_COUNT = 1
MAX_NESTING_DEPTH = 2
MIN_LINE_COUNT = 5
MAX_LINE_COUNT = 10 
OUTPUT_DATASET_NAME = "train_data" 

DATA_SIZE = 1_000_000
NB_PROCESSES = 10

we then run python script.py

after this run, we should have a folder under data_gen named train_data that contains both binary training data and the in-distribution test file

next, we generate the ood-dataset (named 'ood_data') dataset by setting the following hyperparameters :

MAX_DIGIT_COUNT = 3
MAX_CHAR_COUNT = 15
MIN_CHAR_COUNT = 11
MAX_NESTING_DEPTH = 2
MIN_LINE_COUNT = 5
MAX_LINE_COUNT = 10 
OUTPUT_DATASET_NAME = "ood_data" 

DATA_SIZE = 100_000
NB_PROCESSES = 10

we then run python script.py

after this run, we should have another folder under data_gen named ood_data that contains both binary training data and the out-of-distribution test file (we only care about the test file since we are going to train on the in-distribution dataset)

STEP 2 : training and testing

now that we have the dataset ready, we can do both training and evaluation at the same time !

so now we want to open the training and eval gates and close the data generation gate, by setting the following hyperparams

generate_data = False
train_model = True
evaluate_model = True

we then set the training and evaluation hyperparameters :

# training :

DEVICE_IDS = [0,1,2,3]
NB_CHECKPOINTS = 10
TRAIN_DATASET_NAME = 'train_data'
SEED = 1



# evaluation :

EVAL_DATASETS_NAMES = ['train_data', 'ood_data']
EVAL_DATASETS_TITLES = ['in_distribution', 'out_of_distribution']
# the previous two lists must be of the same length

EVAL_ALL_CHECKPOINTS = False

DEVICE_EVAL_IDS = [0,1,2,3]

notice how NB_CHECKPOINTS is set to 10, and that since we want to evaluate the best model on two test sets, we mentioned both test sets : ['train_data', 'ood_data'], and have set EVAL_ALL_CHECKPOINTS to False

we then run python script.py

and after training and eval is complete, we now have access to all the results in ./model_eval/eval_results/

FINAL NOTEs :

in reality, all separate scripts inside of each STEP can be run independently, but to simplify things, we created script.py to launch all scripts from there without having to go through the hassle of manual process launching. This means that you can modify the scripts however you like and update visible/hidden hyperparameters in this project depending on your needs
the main terminal may not show enough info about the progress made during each step (data generation, training ..etc), so you can do a manual check by running first screen -ls to list the steps currently running and then screen -r SCREEN_NAME to see the live progress in the corresponding step, you can always exit a screen by holding ctrl and in parallel clicking a then d (your finger still on ctrl)
do not make edits to script.py meanwhile the processes are running because internal scripts use the hyperparameters mentioned in the main script.
when generating data, do not use names of existings datasets, or the reserved name 'data'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Environment setup

1.1 using conda's environement.yml upload

1.2 using python's venv (the requirements.txt way)

2. GLVN project

2.1 `data_gen`

2.2 `model_train`

2.3 `model_eval`

2.4 `script.py`

3. Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_gen		data_gen
model_eval		model_eval
model_train		model_train
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
script.py		script.py

Folders and files

Latest commit

History

Repository files navigation

1. Environment setup

1.1 using conda's environement.yml upload

1.2 using python's venv (the requirements.txt way)

2. GLVN project

2.1 data_gen

2.2 model_train

2.3 model_eval

2.4 script.py

3. Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2.1 `data_gen`

2.2 `model_train`

2.3 `model_eval`

2.4 `script.py`

Packages