This repository contains our solution for the ‘AI Agent 007: Tooling up for success’ problem statement, provided by DevRev as a part of Inter-IIT Tech 12.0.
We present RTaC, which reconceptualizes the task of tooling as a coding task to exploit the powerful code-comprehension capabilities of LLMs. RTaC provides tools to be used in docstring format to instruction-finetuned Coding-Base LLMs, extracts output in Python code format, and then deterministically converts it to JSON. RTaC promotes docstring reading capability in the LLMs and hence supports tool modification, addition and deletion. Using RTaC, we achieve GPT-4 benchmark performance while employing just DeepSeek 1.3B and CodeLLama 7B LLMs, despite a drastic reduction in parameter count by over 300 times. Cost reduction per query by over five times is achieved while matching GPT-4’s latency. Moreover, RTaC supports the processing of complex conditional and iterative logic (Bonus), surpassing GPT-4's capabilities.
The application has been deployed for convenience and can be accessed here.
Note
Ensure the following dependencies are installed on your system:
-
Clone the Repository
git clone https://github.com/devrev-high/Final_Code cd ./website/frontend -
Install Frontend Dependencies
npm install
-
Configure Environment
- Create a
.envfile based on.env.example.
- Create a
-
Start the Server
npm run dev
- Navigate to backend directory
cd ./website/backend - Build the Docker Image
docker-compose build
- Run the Docker Image
docker-compose up
In case, you are unable to run the deployed website or run the website locally, an interactive playground notebook can also be accessed here.
Note: This supports only limited functionality, and is meant for backup purposes only.
Change directory to RTaC by running the following command:
cd ./RTaCRun the below command to install all dependencies automatically:
conda env create -f environment_droplet.ymlThe dataset can be generated from scratch by running the dataset_main notebook in the notebooks folder. The notebook creates a generated directory within the datasets folder and saves the generated datasets in the required format in that directory. The exact method and prompt formations have been outlined in detail in the notebook itself.
For our experiments, all our datasets were evaluated by a human, often involving corrections due to errors incurred during the generation stage. We have provided the pre-generated datasets in the datasets/pre-generated folder. We strongly encourage training models on the pre-generated dataset.
To maintain credibility and verifiability, we generate data for three different scenarios mentioned in the report:
- Evaluating Few-Shot prompting of CodeLLMs (referred to as P1) (check section 4.2.1 of the report)
- Training and Evaluating CodeLLMs for the tool-memorisation methodology (referred to as P2) (check section 4.2.2 of the report)
- Training and Evaluating RTaC (our proposed final pipeline) (referred to as P3) (check section 4.2.3 of the report)
We adopt the Self-instruct methodology to generate our datasets, which utilizes GPT-4 to generate queries and outputs, encompassing the tool list passed to it in the prompt. Further, we split the task of query and output generation between two distinct LLM agents to tackle the vulnerability of LLMs to hallucinations.
Our best RTaC models can be downloaded locally using the following command:
gdown https://drive.google.com/drive/folders/1lpJCVKcnz93K_dvhZa51hVijid-IuwNr?usp=sharing --folder --remaining-okAny open-sourced LLM can be fine-tuned using the fine_Tuning.py script provided in the src directory. The script can be used to train on both locally stored datasets and open-sourced datasets hosted on Hugging Face. It is built to cover both the scenarios of training - P2 and P3, simply by changing the dataset that the model is being trained on.
Below is a template command to initiate fine-tuning using the fine_Tuning.py script.
python executables/finetuning.py --pipeline <pipeline_version> --repo_dir <finetuning_mode> --dataset_1 <stage_1_dataset_name> --dataset_2 <stage_2_dataset_name> --base_model <model_name> --n_1 <num_epochs_stage_1> --n_2 <num_epochs_stage_2> --lora_alpha_1 <lora_alpha_value_stage_1> --lora_alpha_2 <lora_alpha_value_stage_2> --lora_dropout_1 <lora_dropout_value_stage_1> --lora_dropout_2 <lora_dropout_value_stage_2> --lora_r_1 <lora_r_value_stage_1> --lora_r_2 <lora_r_value_stage_2> --learning_rate_1 <learning_rate_stage_1> --learning_rate_2 <learning_rate_stage_2>To train under the P2 (tool-memorization) scenario, run the following command:
python executables/fine_Tuning.py --pipeline 2 --repo_dir 2 --dataset_1 datasets/Pre-Generated/P2_datasets/train_val --base_model codellama/CodeLlama-7b-Instruct-hfTo train under the P3 (RTaC) scenario, run the following command:
python executables/fine_Tuning.py --pipeline 3 --repo_dir 2 --dataset_1 datasets/Pre-Generated/P3_datasets/train_val/Stage-1 --dataset_2 datasets/Pre-Generated/P3_datasets/train_val/Stage-2 --base_model codellama/CodeLlama-7b-Instruct-hf
For a detailed explanation of each argument, refer to the subsequent table:
| Option | Description | Type | Default |
|---|---|---|---|
--pipeline |
Enter 2 for p2, 3 for p3 | int | 3 |
--repo_dir |
Enter 1 for hf repo, 2 for local dir | int | 2 |
--dataset_1 |
Name of stage 1 dataset to finetune | str | "datasets/Pre-Generated/P3_datasets/train_val/Stage-1" |
--dataset_2 |
Name of stage 2 dataset to finetune | str | "datasets/Pre-Generated/P3_datasets/train_val/Stage-2" |
--base_model |
Name of base model to finetune | str | "RtaC-Models/codellama/CodeLlama-7b-Instruct-hf" |
--n_1 |
Number of stage 1 epochs | int | 5 |
--n_2 |
Number of stage 2 epochs | int | 5 |
--lora_alpha_1 |
Alpha parameter value for stage 1 LoRA | int | 16 |
--lora_alpha_2 |
Alpha parameter value for stage 2 LoRA | int | 16 |
--lora_dropout_1 |
Dropout parameter value for stage 1 LoRA | float | 0.1 |
--lora_dropout_2 |
Dropout parameter value for stage 2 LoRA | float | 0.1 |
--lora_r_1 |
R parameter value for stage 1 LoRA | int | 8 |
--lora_r_2 |
R parameter value for stage 2 LoRA | int | 8 |
--learning_rate_1 |
Value of learning rate for stage 1 | float | 2e-4 |
--learning_rate_2 |
Value of learning rate for stage 2 | float | 2e-4 |
The model outputs are generated in a Python-inspired format. A Code-to-JSON converter was built to convert the Python outputs to the desirable JSON format. The working of the converter has been briefly explained below:
-
To convert the model’s generated code to the required JSON format, we use a Python script. This file is modelled as a compiler-type script and is a key component of our pipeline.
-
Each line is individually classified into either a bonus (if/for) case or a general case. The bonus cases go through their respective handlers and are then treated like the general case.
-
The typical flow of any case involves the following calls:
process_toolcallsmake_toolfor each validtool_name,make_toolcallsupdate_arg_valfor each valid argument name.
For a more detailed explanation of how the converter works, please refer to section A.1 of the report.
Inference and Evaluation can be carried out by running the inference_main notebook in the notebooks folder. This notebook creates an output directory and stores the generated outputs as CSV files.
The inference is independently conducted for all three scenarios: P1, P2 and P3. Each of these scenarios is evaluated on three types of test datasets: Static, Dynamic and Bonus. Our trained models have been uploaded on HuggingFace. The inference and evaluation notebook directly loads the models from HuggingFace.
- For each scenario (P1, P2, and P3), run the respective inference code blocks for Static, Dynamic, and Bonus datasets.
- The outputs are saved in separate CSV files for each dataset type.
This section evaluates the performance of each pipeline on the three test datasets. Evaluation is done using the following metrics:
- Precision
- Recall
- F1 Score
- LangChain Metric
- For each scenario (P1, P2, and P3), run the respective evaluation code blocks for Static, Dynamic, and Bonus datasets.
- The evaluation scores are printed as output in the notebook itself.
A sample result from our original set of experiments is shown below:


