DevRev High Prep: Reimagining Tooling as Coding(RTaC)

This repository contains our solution for the ‘AI Agent 007: Tooling up for success’ problem statement, provided by DevRev as a part of Inter-IIT Tech 12.0.

Figure 1: Our proposed pipeline - RTaC

We present RTaC, which reconceptualizes the task of tooling as a coding task to exploit the powerful code-comprehension capabilities of LLMs. RTaC provides tools to be used in docstring format to instruction-finetuned Coding-Base LLMs, extracts output in Python code format, and then deterministically converts it to JSON. RTaC promotes docstring reading capability in the LLMs and hence supports tool modification, addition and deletion. Using RTaC, we achieve GPT-4 benchmark performance while employing just DeepSeek 1.3B and CodeLLama 7B LLMs, despite a drastic reduction in parameter count by over 300 times. Cost reduction per query by over five times is achieved while matching GPT-4’s latency. Moreover, RTaC supports the processing of complex conditional and iterative logic (Bonus), surpassing GPT-4's capabilities.

The application has been deployed for convenience and can be accessed here.

1. Getting Started
- 1.1 Pre-requisites
- 1.2 Quick start
2. Build from scratch

Getting Started - Running Application Locally

Pre-requisites

Note

Ensure the following dependencies are installed on your system:

Quick start

Using Git

Clone the Repository

git clone https://github.com/devrev-high/Final_Code
cd ./website/frontend

Install Frontend Dependencies
```
npm install
```
Configure Environment
- Create a .env file based on .env.example.
Start the Server
```
npm run dev
```

Using Docker

Navigate to backend directory
```
cd ./website/backend
```
Build the Docker Image
```
docker-compose build
```
Run the Docker Image
```
docker-compose up
```

In case, you are unable to run the deployed website or run the website locally, an interactive playground notebook can also be accessed here.

Note: This supports only limited functionality, and is meant for backup purposes only.

Build from Scratch

Change directory to RTaC by running the following command:

cd ./RTaC

Dependencies

Run the below command to install all dependencies automatically:

conda env create -f environment_droplet.yml

Dataset Generation

The dataset can be generated from scratch by running the dataset_main notebook in the notebooks folder. The notebook creates a generated directory within the datasets folder and saves the generated datasets in the required format in that directory. The exact method and prompt formations have been outlined in detail in the notebook itself.

For our experiments, all our datasets were evaluated by a human, often involving corrections due to errors incurred during the generation stage. We have provided the pre-generated datasets in the datasets/pre-generated folder. We strongly encourage training models on the pre-generated dataset.

Data Generation Process:

To maintain credibility and verifiability, we generate data for three different scenarios mentioned in the report:

Evaluating Few-Shot prompting of CodeLLMs (referred to as P1) (check section 4.2.1 of the report)
Training and Evaluating CodeLLMs for the tool-memorisation methodology (referred to as P2) (check section 4.2.2 of the report)
Training and Evaluating RTaC (our proposed final pipeline) (referred to as P3) (check section 4.2.3 of the report)

We adopt the Self-instruct methodology to generate our datasets, which utilizes GPT-4 to generate queries and outputs, encompassing the tool list passed to it in the prompt. Further, we split the task of query and output generation between two distinct LLM agents to tackle the vulnerability of LLMs to hallucinations.

Downloading Models

Our best RTaC models can be downloaded locally using the following command:

gdown https://drive.google.com/drive/folders/1lpJCVKcnz93K_dvhZa51hVijid-IuwNr?usp=sharing --folder --remaining-ok

Fine Tuning

Description

Any open-sourced LLM can be fine-tuned using the fine_Tuning.py script provided in the src directory. The script can be used to train on both locally stored datasets and open-sourced datasets hosted on Hugging Face. It is built to cover both the scenarios of training - P2 and P3, simply by changing the dataset that the model is being trained on.

Usage:

Below is a template command to initiate fine-tuning using the fine_Tuning.py script.

python executables/finetuning.py --pipeline <pipeline_version> --repo_dir <finetuning_mode> --dataset_1 <stage_1_dataset_name> --dataset_2 <stage_2_dataset_name> --base_model <model_name> --n_1 <num_epochs_stage_1> --n_2 <num_epochs_stage_2> --lora_alpha_1 <lora_alpha_value_stage_1> --lora_alpha_2 <lora_alpha_value_stage_2> --lora_dropout_1 <lora_dropout_value_stage_1> --lora_dropout_2 <lora_dropout_value_stage_2> --lora_r_1 <lora_r_value_stage_1> --lora_r_2 <lora_r_value_stage_2> --learning_rate_1 <learning_rate_stage_1> --learning_rate_2 <learning_rate_stage_2>

To train under the P2 (tool-memorization) scenario, run the following command:

python executables/fine_Tuning.py --pipeline 2 --repo_dir 2 --dataset_1 datasets/Pre-Generated/P2_datasets/train_val --base_model codellama/CodeLlama-7b-Instruct-hf

To train under the P3 (RTaC) scenario, run the following command:

python executables/fine_Tuning.py --pipeline 3 --repo_dir 2 --dataset_1 datasets/Pre-Generated/P3_datasets/train_val/Stage-1 --dataset_2 datasets/Pre-Generated/P3_datasets/train_val/Stage-2 --base_model codellama/CodeLlama-7b-Instruct-hf

For a detailed explanation of each argument, refer to the subsequent table:

Option	Description	Type	Default
`--pipeline`	Enter 2 for p2, 3 for p3	int	3
`--repo_dir`	Enter 1 for hf repo, 2 for local dir	int	2
`--dataset_1`	Name of stage 1 dataset to finetune	str	"datasets/Pre-Generated/P3_datasets/train_val/Stage-1"
`--dataset_2`	Name of stage 2 dataset to finetune	str	"datasets/Pre-Generated/P3_datasets/train_val/Stage-2"
`--base_model`	Name of base model to finetune	str	"RtaC-Models/codellama/CodeLlama-7b-Instruct-hf"
`--n_1`	Number of stage 1 epochs	int	5
`--n_2`	Number of stage 2 epochs	int	5
`--lora_alpha_1`	Alpha parameter value for stage 1 LoRA	int	16
`--lora_alpha_2`	Alpha parameter value for stage 2 LoRA	int	16
`--lora_dropout_1`	Dropout parameter value for stage 1 LoRA	float	0.1
`--lora_dropout_2`	Dropout parameter value for stage 2 LoRA	float	0.1
`--lora_r_1`	R parameter value for stage 1 LoRA	int	8
`--lora_r_2`	R parameter value for stage 2 LoRA	int	8
`--learning_rate_1`	Value of learning rate for stage 1	float	2e-4
`--learning_rate_2`	Value of learning rate for stage 2	float	2e-4

JSON Converter

Figure 2: Code-to-JSON Converter

The model outputs are generated in a Python-inspired format. A Code-to-JSON converter was built to convert the Python outputs to the desirable JSON format. The working of the converter has been briefly explained below:

To convert the model’s generated code to the required JSON format, we use a Python script. This file is modelled as a compiler-type script and is a key component of our pipeline.
Each line is individually classified into either a bonus (if/for) case or a general case. The bonus cases go through their respective handlers and are then treated like the general case.
The typical flow of any case involves the following calls: process_tool calls make_tool for each valid tool_name, make_tool calls update_arg_val for each valid argument name.

For a more detailed explanation of how the converter works, please refer to section A.1 of the report.

Inference and Evaluation

Inference and Evaluation can be carried out by running the inference_main notebook in the notebooks folder. This notebook creates an output directory and stores the generated outputs as CSV files.

Inference

The inference is independently conducted for all three scenarios: P1, P2 and P3. Each of these scenarios is evaluated on three types of test datasets: Static, Dynamic and Bonus. Our trained models have been uploaded on HuggingFace. The inference and evaluation notebook directly loads the models from HuggingFace.

Usage

For each scenario (P1, P2, and P3), run the respective inference code blocks for Static, Dynamic, and Bonus datasets.
The outputs are saved in separate CSV files for each dataset type.

Evaluation

Description

This section evaluates the performance of each pipeline on the three test datasets. Evaluation is done using the following metrics:

Precision
Recall
F1 Score
LangChain Metric

Usage

For each scenario (P1, P2, and P3), run the respective evaluation code blocks for Static, Dynamic, and Bonus datasets.
The evaluation scores are printed as output in the notebook itself.

A sample result from our original set of experiments is shown below:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DevRev High Prep: Reimagining Tooling as Coding(RTaC)

Figure 1: Our proposed pipeline - RTaC

Table of Contents

Getting Started - Running Application Locally

Pre-requisites

Docker and Docker Compose

Python 3.0 or Later

Node Package Manager (npm)

Quick start

Using Git

Using Docker

Build from Scratch

Dependencies

Dataset Generation

Data Generation Process:

Downloading Models

Fine Tuning

Description

Usage:

JSON Converter

Figure 2: Code-to-JSON Converter

Inference and Evaluation

Inference

Usage

Evaluation

Description

Usage

Figure 3: Evaluation Result

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DevRev High Prep: Reimagining Tooling as Coding(RTaC)

Figure 1: Our proposed pipeline - RTaC

Table of Contents

Getting Started - Running Application Locally

Pre-requisites

Docker and Docker Compose

Python 3.0 or Later

Node Package Manager (npm)

Quick start

Using Git

Using Docker

Build from Scratch

Dependencies

Dataset Generation

Data Generation Process:

Downloading Models

Fine Tuning

Description

Usage:

JSON Converter

Figure 2: Code-to-JSON Converter

Inference and Evaluation

Inference

Usage

Evaluation

Description

Usage

Figure 3: Evaluation Result