Source Code Authorship attribution

File organization

inputs>processed_dfs - datasets for actual training of the models, they are more light-weight, then original dataset, which is hidden by .gitignore
src - python and .ipynb files

2.1 data_processing/ - work with direct GCJ (time-consuming)

2.2 models/ - main directory with the models with the inheritance hierarchy, described below

2.3 training/ - things, which are used by the models: GridSearch methods and Training callback

2.4 main.py - starting point
outputs - images, models e.t.c

How to run

Install the requirements from requiremnets.txt: pip install -r requirements.txt
Change the src/main.py as appropriate: (uncomment the following lines, if needed):

2.1 To train the embedding-based model:

embedding = Embedding(make_initial_preprocess=False)
embedding.train(batch_size=128, epochs=1)

This commands will create the model and train it for one epoch with the batch size 128. The dataset will be taken from inputs/preprocessed_jsons/embedding_train.json, if make_initial_preprocess is set to False. Otherwise, the access to the raw data is required.

2.2 To train the embedding-based model:

conv2d = Conv2D(make_initial_preprocess=True)
conv2d.train(batch_size=128, epochs=1)

Warning: to train this model, the row dataset(py_df.csv) is required

2.3 To generate the images, which represent the focus of the models:

Visualizer("conv2d").run()
Visualizer("embedding").run()

Example of embedding-based visualization

Example of conv2d visualization

2.4 To show all the layers of the models:

KeractVisualizer("conv2d").run()
KeractVisualizer("embedding").run()

Run python3 src/main.py and fix import errors, if there are

Class diagram

Model - root class (interface for all models)
Triplet(Model) - triplet-loss specific methods (batch generation, fit process, full model creation e.t.c)
Embedding(Triplet) - actual realization of the target architecture

More detailed visualization

Visualization:

Visualizer - visualization, based on tf-keras-vis, which performs per-pixel modifications of the image, which potentially can lead to errors
KeractVisualizer - visualization, based on keract library.

WARNING: when using, substitute the ._layers call with .layers call in keract.py file within a library in case of error (tensorflow version 2.5.0, keract version 4.4.0)

Useful commands

>> tensorboard --logdir=outputs/tensor_board
>> source ./venv/bin/activate
>> source /opt/anaconda/bin/activate root
>> docker run --gpus all --device /dev/nvidia0 --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools --device /dev/nvidiactl -v /home/alina/SourceCodeAuthorshipAttribution/:/usr/app/ 748cf8b681db python /usr/app/src/main.py
>> docker build -t scaa .

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.github/workflows		.github/workflows
.idea		.idea
inputs		inputs
media		media
outputs		outputs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Readme.md		Readme.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
sonar-project.properties		sonar-project.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Source Code Authorship attribution

File organization

How to run

Class diagram

More detailed visualization

Visualization:

Useful commands

Useful links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Source Code Authorship attribution

File organization

How to run

Class diagram

More detailed visualization

Visualization:

Useful commands

Useful links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages