This project integrates tools for analyzing, compiling, and fine-tuning language models to correct and analyze source code. It combines components in C++ and Python, as well as notebooks for the fine-tuning of LLMs
Contains a compiler and parser for source code written in C++:
- lexer/: Implements lexical analysis (tokenization) of source code.
- parser/: Implements syntactic analysis.
- grammar/tok.txt: List of tokens recognized by the compiler.
- grammar/output.txt: Sequence of token indices generated by the lexer.
- grammar/token_mapper.py: A Python script that maps the indices in
output.txtto the token names fromtok.txt, printing the sequence of tokens and respecting line breaks (NEWLINE)
Run this command from the root of the project or from within the CPP_Compiler/grammar/ directory:
python3 CPP_Compiler/grammar/token_mapper.pyThe script will print the sequence of tokens corresponding to the indices in output.txt, separating lines each time it encounters the token NEWLINE.
Contains notebooks and scripts for fine-tuning language models (LLMs):
- Notebooks: Examples and experiments of fine-tuning using various frameworks and datasets.
- dataset/: Datasets and scripts for preparing training data.
Install the necessary dependencies for fine-tuning with:
pip install -r fine-tuning-LLM/requirements.txt