Skip to content

Retrieval-Augmented Generation (RAG) pipeline for automated bug localization and patch suggestion. It indexes a software repository at the function or line-chunk level, retrieves the most semantically relevant code fragments, and uses a local large language model (LLM) to propose unified diff patches with human-readable rationales.

Notifications You must be signed in to change notification settings

jsung1997/Self_Adaptive_RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview: Traditional static analysis tools detect syntax-level issues but often fail to reason about semantic bugs (e.g., missing condition checks, null pointer issues, incorrect logic). This project addresses that gap by combining vector-based retrieval and natural language reasoning to locate likely bug sources and propose contextually relevant patches.

Key Features: Semantic Retrieval – Uses transformer-based embeddings (SentenceTransformers) to understand natural-language queries and map them to code semantics. Hybrid Ranking – Combines FAISS vector similarity with TF-IDF token matching for better recall on identifiers and keywords. Local LLM Reasoning – Employs Ollama to run models such as llama3.1, deepseek-coder:6.7b, or qwen2.5-coder:7b without internet or API costs. Automated Diff Generation – Outputs human-readable git diff patches ready for manual validation. Explainable Output – Provides a short “Rationale” explaining why the change is needed. Offline & Reproducible – Fully local; no cloud dependencies or privacy risks. Modular Design – Easily replaceable embeddings, retrieval engine, or LLM backend.

Technical Components:

Layer Description Library
Chunking Line-based overlapping segmentation for contextual integrity. Default: 160 lines, 40 overlap. Native Python
Embeddings Semantic representation of code using transformer models (all-MiniLM-L6-v2). SentenceTransformers
Vector Index Approximate nearest neighbor search using cosine similarity. FAISS
Lexical Ranking Sparse token weighting for identifier-level matching. scikit-learn TF-IDF
Hybrid Scoring Weighted ensemble: score = 0.65*embed + 0.35*tfidf. Custom
Retriever Returns top-k chunks with file path, line range, and similarity scores. Custom
LLM Interface REST API to local Ollama instance. requests
LLM Models llama3.1, deepseek-coder:6.7b, qwen2.5-coder:7b Ollama

Performance: Metric Small Repo (1k chunks) Medium Repo (10k chunks) Indexing Time ~5s ~45s Retrieval Latency < 0.1s < 0.3s LLM Generation 3–10s Depends on model Index Size ~2 MB ~15 MB

Research Relevance: This project lies at the intersection of: RAG-based code understanding LLM-assisted software maintenance Self-adapting machine learning (SHML) AI-driven software defect prediction Potential applications include: Automatic bug triage in large-scale repositories Context-aware patch suggestion systems Foundation for “self-healing software” pipelines where the model adapts to recurring failure patterns.

About

Retrieval-Augmented Generation (RAG) pipeline for automated bug localization and patch suggestion. It indexes a software repository at the function or line-chunk level, retrieves the most semantically relevant code fragments, and uses a local large language model (LLM) to propose unified diff patches with human-readable rationales.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published