GitHub - regularpooria/BLADE: BLADE - Bug Localization Assisted Debugging Engine

BLADE - Bug Localization Assisted Debugging Engine

A research/engineering workspace for experimenting with bug localization on the BugsInPy benchmark using embedding models. It includes utilities to clone/prepare buggy projects, extract error tracebacks, embed text/code with a Hugging Face model, search for likely buggy code chunks, and compute metrics (MAP/MRR, success rates).

Overall Architecture

Here are some figures that illustrate the overall architecture and different steps of the BLADE pipeline:

Full Report (PDF)
Overall View: This figure (Figure 1 in the report) illustrates the entire Automated Program Repair (APR) pipeline, outlining its four main stages: Input, Bug Localization, Program Repair, and Comparison of Approaches.
Embedding Pipeline: This figure details the process of converting code and bug reports into numerical representations (embeddings) for similarity-based retrieval, as discussed in the context of model selection and retrieval methods.
Step 1: Input Processing: This figure (Figure 2 in the report) describes the initial phase of input processing, which includes data preparation, code preparation (cloning, filtering, chunking), and bug preparation (traceback extraction).
Step 2: Bug Localization: This figure (Figure 3 in the report) outlines the bug localization pipeline, showcasing how embedded bug traces are matched against code chunks to identify potential bug locations.
Step 3: Automated Program Repair (APR): This figure represents the final stage of the pipeline, where the localized bug information is used to generate and test potential fixes, aiming to automate the program repair process.

Core code:
- scripts/embedding.py: Embedding utilities powered by transformers/torch and FAISS.
- scripts/bugsinpy_utils.py: Helpers to clone projects, parse diffs/tracebacks, and run setup/tests.
- See RUN_ANALYSIS.md for running file/function analysis.
Data:
- BugsInPy/: The BugsInPy dataset/projects layout used by the pipeline.
- dataset/: Generated code_chunks.json and embedding.npy per project/bug.
- tmp/: Working directory for cloned repos, venvs, and intermediate files.
Cluster docs: See cluster_how_to/ (overview: cluster_how_to/cluster_readme.md).
How chunking and embedding works: See CHUNK_EMBEDDING.md (driven by generate_dataset_from_bugsinpy.ipynb).
How to run analysis (file + function): See RUN_ANALYSIS.md for both modes.
Generate single LLM patches: See GENERATE_SINGLE_LLM_PATCHES.md to produce candidate function fixes from an LLM.

Prerequisites

Python: 3.9 (recommended; aligns with the provided Dockerfile).
OS: Linux/macOS/WSL2. GPU is optional but accelerates embedding.
pip/venv: Make sure you can create virtual environments.

Local setup

Clone the repository

git clone https://github.com/regularpooria/bootcamp
cd bootcamp

Initialize git submodules (required)
```
git submodule update --init --recursive
```

Create and activate a virtual environment

python3.9 -m venv .venv
source .venv/bin/activate

Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Optional: Configure environment variables by creating a .env file at the repository root:

# Hugging Face model used for embeddings (default is below)
MODEL_NAME=regularpooria/blaze_code_embedding
# Batch size for embedding
BATCH_SIZE=128
# Optional: set Hugging Face cache to avoid re-downloading
# HF_HOME=./models/hf-cache
# TRANSFORMERS_CACHE=./models/hf-cache

Quickstart

Open in Colab: Run analysis quickstart notebook
Open in Colab: Chunk embedding quickstart notebook
See RUN_ANALYSIS.md for step‑by‑step instructions to run file‑level and function‑level analyses (via notebooks and scripts described there).
Run tests:
```
pytest
```

Running on the research clusters

Start here:

cluster_how_to/cluster_readme.md — overview and sequence

Deep dives:

cluster_how_to/login.md — access, SSH keys, Duo
cluster_how_to/folders.md — where to keep code/data
cluster_how_to/clone.md — cloning, venvs, and installing dependencies efficiently
cluster_how_to/limitations.md — important constraints (no Docker in jobs, no Internet, caching)

Key cluster notes:

Jobs have no Internet access; prepare/copy caches (e.g., Hugging Face) on the login node.
Prefer module load for system packages; use pip --no-index where possible per DRA docs.
Use Python virtual environments per project.
For containerized execution on clusters, use Apptainer (see the docs above).

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.vscode		.vscode
BugsInPy @ ae4b6bf		BugsInPy @ ae4b6bf
cluster_how_to		cluster_how_to
jobs		jobs
report		report
scripts		scripts
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
.gitignore_embedding		.gitignore_embedding
.gitmodules		.gitmodules
CHUNK_EMBEDDING.md		CHUNK_EMBEDDING.md
GENERATE_SINGLE_LLM_PATCHES.md		GENERATE_SINGLE_LLM_PATCHES.md
LICENSE		LICENSE
README.md		README.md
RUN_ANALYSIS.md		RUN_ANALYSIS.md
generate_dataset_from_bugsinpy.ipynb		generate_dataset_from_bugsinpy.ipynb
generate_single_llm_patches.ipynb		generate_single_llm_patches.ipynb
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_analysis_file.ipynb		run_analysis_file.ipynb
run_analysis_function.ipynb		run_analysis_function.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BLADE - Bug Localization Assisted Debugging Engine

Overall Architecture

Contents

Prerequisites

Local setup

Quickstart

Running on the research clusters

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

regularpooria/BLADE

Folders and files

Latest commit

History

Repository files navigation

BLADE - Bug Localization Assisted Debugging Engine

Overall Architecture

Contents

Prerequisites

Local setup

Quickstart

Running on the research clusters

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages