A highly token-efficient code-aware RAG system specialized for repository understanding.
~80% token reduction vs naive RAG on large codebases, while maintaining comparable accuracy.
Highlights • Performance • Use Cases • Deep Dive • Architecture • Quick Start • Results • Modules • Metrics
- Specialized Code Chunking: AST-aware file/class/function/block chunking with structured data extraction
- LLM Summary Generation: Auto-generated chunk summaries during indexing for better retrieval quality
- Multi-stage Retrieval Pipeline: Query expansion + vector search + metadata filtering + reranking
- Chinese Optimization: n-gram matching with meaningless pronoun exclusion
- Token Efficiency: ~80% token reduction vs non-optimized RAG on large repos (14100 → 1634 tokens)
- Dual Model Strategy: Fast model for simple questions, strong model for complex questions, ensuring accuracy while optimizing cost and latency
- Extensible Architecture: Vector storage abstraction layer for future migration
- FastAPI & MCP Support: Production-ready API and Model Context Protocol for easy integration
| Approach | Recall | Cost |
|---|---|---|
| Naive RAG | High | Very High (full files) |
| RepoMind | Comparable | ~80% lower (summaries + structured data) |
- Small repos: Comparable or slightly better accuracy than naive RAG
- Large repos: ~5-10% lower accuracy in single-query setting, but massive token savings
- Token reduction: ~88% on medium-large projects (14100 → 1634 tokens), ~21% on small projects (3163 → 2502 tokens)
See full baseline results below for detailed metrics.
- AI Agent Context Provider: Integrate with Claude Desktop or other AI tools via MCP to provide codebase context with minimal token overhead
- Large Repo Exploration: Efficiently navigate and understand internal tools or niche open-source projects without sending entire files to LLMs
- Team Knowledge Base: Help new team members onboard faster by answering codebase questions with grounded, verifiable answers
Challenge: Balancing granularity and context for optimal retrieval
Solution:
- File-level: Whole module overview with imports and top-level structure
- Class-level: Class responsibilities and methods
- Function-level: Function inputs, outputs, and call relationships
- Block-level: Code blocks in script files
Trade-offs: Finer granularity improves precision but may lose context; solved with low-cost fast LLM-generated summaries that preserve context while keeping individual chunks focused.
Challenge: Chinese queries require different handling, and diversity matters in retrieval results
Solution:
- Chinese n-gram Matching: 2-gram + 3-gram for better Chinese keyword matching
- Meaningless Word Filter: Exclusion table for Chinese pronouns ("我", "我们", "你", "你们", etc.)
- Bucket Guarantee: At least 1 document chunk + 1 code chunk to ensure diversity
- MMR Diversity: Maximal Marginal Relevance for result diversity
- Weight Tuning: alpha=0.85 (cosine similarity), beta=0.15 (keyword score) - keywords as "icing on the cake"
Challenge: Reducing token usage while maintaining answer quality
Solution:
- LLM Summaries: Use low-cost fast LLM (default: qwen-flash) to generate concise summaries instead of sending full code
- Dual Model Strategy: Simple questions use fast model (qwen-flash), complex questions use strong model (qwen3.5-plus), saving cost and optimizing response speed
- Structured Data: Extract imports, signatures, calls instead of using full code
- Smart Context Packing: Prioritize summary > structured data > code
graph TD
A[Query Input] --> B[Query Expansion<br/>MQE]
B --> C[Query Classification<br/>Simple/Complex]
C --> D[Multi-stage Retrieval Pipeline]
subgraph D_Pipeline[Multi-stage Retrieval Pipeline]
D1[1. Vector Retrieval<br/>FAISS Top 20]
D2[2. Bucket Guarantee<br/>Docs + Code]
D3[3. Keyword Scoring<br/>Chinese n-gram]
D4[4. MMR Reranking<br/>Diversity]
D5[5. Final Selection<br/>Top 5]
end
D --> D1
D1 --> D2
D2 --> D3
D3 --> D4
D4 --> D5
D5 --> E[Context Building]
subgraph Context[Context Building]
E1[Chunk Summary<br/>LLM Generated]
E2[Structured Data<br/>imports, signatures, calls]
E3[Raw Code<br/>Optional]
end
E --> E1
E --> E2
E --> E3
E --> F[Answer Generation<br/>Dual Model Strategy]
subgraph Gen[Answer Generation]
F1[Simple Question<br/>qwen-flash]
F2[Complex Question<br/>qwen3.5-plus]
end
C -->|Simple| F1
C -->|Complex| F2
F1 --> G[Answer Output]
F2 --> G
- Python 3.9+
- Conda environment:
RepoMind
conda create -n RepoMind python=3.11
pip install -r requirements.txtCopy .env.example to .env and configure:
cp .env.example .env
# Edit .env file, set QWEN_API_KEYUse the unified RepoMind class with all configurable options:
from repomind import RepoMind
# Initialize with default configuration
repomind = RepoMind()
# Or with custom configuration
repomind = RepoMind(
enable_query_expansion=True, # Enable query expansion (MQE)
enable_query_classification=True, # Enable question classification
query_expansion_variants=2, # Number of query expansion variants
use_fast_llm_for_expansion=True, # Use fast LLM for query expansion
use_hybrid_answer_generation=True, # Hybrid answer generation (fast for simple)
)
# Index a repository
repomind.index_repository("/path/to/repo")
# Query
result = repomind.query("What does this project do?")
print(result["answer"])conda activate agentEnv && python scripts/test_core.pyconda activate agentEnv && uvicorn repomind.api.main:app --reloadPOST /index
{
"repo_path": "/path/to/repository"
}POST /query
{
"question": "What does this project do?"
}Full API documentation: http://localhost:8000/docs
RepoMind supports MCP (Model Context Protocol) for integration with Claude Desktop, Claude Code, and other AI tools:
conda activate agentEnv && python scripts/start_mcp_server.pyMCP Tools:
index_repository(repo_path)- Index a code repositoryquery_repository(question)- Query an indexed repositoryget_health()- Check service healthsave_index(index_path)- Save index to diskload_index(index_path)- Load index from disk
Claude Desktop Configuration: Add to Claude Desktop config:
{
"mcpServers": {
"repomind": {
"command": "conda",
"args": ["run", "-n", "RepoMind", "python", "/path/to/RepoMind/scripts/start_mcp_server.py"]
}
}
}See docs/MODULES.md.
For evaluation metrics, see docs/METRICS.md. Tested projects:
- travel_agent (small): LLM-based travel assistant agent (see
测试仓库/) - cuezero (medium-large): High-performance billiards AI system (https://github.com/sadlavaarsc/CueZero)
| System | Description |
|---|---|
| LLM-only | No retrieval (specific files provided as context, with necessary truncation for large files to save cost) |
| Naive RAG | Non-optimized generic RAG implementation, using file-level chunks to avoid recall degradation from fragmented splitting |
| Structured RAG | Complete ingestion pipeline + naive retrieval + naive rerank |
| Full System | Full optimization (qwen3.5-plus) |
| Full System Fast | Full optimization + dual model strategy (qwen-flash + qwen3.5-plus) |
| System | Avg Recall | Avg Hit Rate | Answerable Rate | E2E Success Rate | Avg Correctness | Avg Grounding | Avg Total Token | Avg Latency(ms) |
|---|---|---|---|---|---|---|---|---|
| llm_only | 0.000 | 0.000 | 0.0% | 40.0% | 2.00 | 0.80 | 3136 | 14463.6 |
| naive_rag | 1.000 | 1.000 | 90.0% | 100.0% | 2.00 | 2.00 | 3163 | 12789.5 |
| structured_rag | 0.975 | 1.000 | 80.0% | 100.0% | 2.00 | 2.00 | 2686 | 13869.1 |
| full_system | 0.975 | 1.000 | 90.0% | 100.0% | 2.00 | 2.00 | 2845 | 37362.6 |
| full_system_fast | 0.975 | 1.000 | 90.0% | 100.0% | 2.00 | 2.00 | 2502 | 15157.2 |
| System | Avg Recall | Avg Hit Rate | Answerable Rate | E2E Success Rate | Avg Correctness | Avg Grounding | Avg Total Token | Avg Latency(ms) |
|---|---|---|---|---|---|---|---|---|
| llm_only | 0.000 | 0.000 | 0.0% | 50.0% | 2.00 | 1.00 | 3590 | 21760.5 |
| naive_rag | 0.500 | 1.000 | 100.0% | 100.0% | 2.00 | 2.00 | 14100 | 15034.3 |
| structured_rag | 0.400 | 0.900 | 70.0% | 70.0% | 1.70 | 2.00 | 3420 | 20691.7 |
| full_system | 0.450 | 1.000 | 100.0% | 80.0% | 1.70 | 2.00 | 2313 | 48915.8 |
| full_system_fast | 0.450 | 1.000 | 100.0% | 90.0% | 1.80 | 2.00 | 1634 | 14342.8 |
Average latency may be slightly higher due to network reasons. Actual performance can be referenced based on actual business conditions and llm_only values. This is for comparison purposes only.
repomind/
├── repomind/
│ ├── ingestion/ # Data parsing and preprocessing
│ ├── indexing/ # Embedding and vector indexing
│ ├── storage/ # Vector storage abstraction
│ ├── retrieval/ # Multi-stage retrieval pipeline
│ ├── generation/ # LLM answer generation
│ ├── evaluation/ # Evaluation metrics
│ ├── api/ # FastAPI service
│ ├── mcp/ # MCP service
│ ├── configs/ # Configuration management
│ ├── baselines/ # Baseline systems
│ └── core.py # RepoMind core class
├── test_suite/ # Test suite
├── scripts/ # Utility scripts
├── tests/ # Test suite
├── requirements.txt
├── README.md
├── README_zh.md
└── CHANGELOG.md
- Vector Storage: FAISS (Facebook AI Similarity Search)
- Embedding Model: text-embedding-v4
- Strong LLM: qwen3.5-plus - for final answer generation
- Fast LLM: qwen-flash - for query expansion, question classification, chunk summary generation, LLM evaluation
- API Framework: FastAPI
- Data Modeling: Pydantic v2
See CHANGELOG.md for a detailed history of changes.
MIT License