Intelligent document parsing and retrieval system optimized for RAG (Retrieval-Augmented Generation) applications
GlutamateIndex is a Python-based framework designed to parse, analyze, and retrieve information from various document types with specialized strategies optimized for each data format. It provides intelligent chunking, semantic search, and LLM-powered analysis for:
- 📚 Research Papers - Academic paper parsing with citation tracking
- 💻 Codebases - Source code analysis with automated documentation generation
- 📄 Documents - PDF, DOCX, PPTX, Markdown, and more
- 🗂️ Structured Data - JSON, XML, CSV processing
- 🔮 Future Support - Legal documents, chat histories, knowledge bases
- Multi-language parsing via Tree-sitter (Python, JavaScript/TypeScript, Java, Go, Rust, C/C++, PHP, Ruby, and more)
- Automated docstring generation using LLMs with intelligent batching
- Code graph construction - Maps relationships between classes, functions, and methods
- Entry point detection - Identifies and documents code execution flows
- Semantic code search with embedding-based retrieval
- Support for NetworkX (in-memory) and Neo4j (persistent) graph stores
- PDF parsing with metadata and structure extraction
- LLM-powered Q&A with precise line-based citations
- Context-aware chunking for large documents
- Reference tracking and source attribution
- OpenAI (GPT-3.5, GPT-4, GPT-4o)
- Anthropic (Claude 3.5 Sonnet, Opus, Haiku)
- Google Gemini (Gemini Pro, Gemini 1.5)
- Groq (Llama 3, Mixtral)
- Ollama (Local models)
- PDF (with image extraction)
- Microsoft Word (DOCX)
- PowerPoint (PPTX)
- Markdown
- JSON, XML, CSV
- Plain text
- Graph Stores: NetworkX, Neo4j
- Vector Stores: ChromaDB, Qdrant, Elasticsearch
- Re-rankers: Cohere, Voyage AI
- Python 3.12 or higher
- Poetry (recommended) or pip
# Clone the repository
git clone https://github.com/yourusername/glutamate.git
cd glutamate
# Install dependencies
poetry install
# Activate the virtual environment
poetry shell# Clone the repository
git clone https://github.com/yourusername/glutamate.git
cd glutamate
# Install dependencies
pip install -r requirement.txtFor Neo4j graph storage:
# Ensure Neo4j is running
docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:latestFor specific LLM providers, set environment variables:
export OPENAI_API_KEY="your-api-key"
export ANTHROPIC_API_KEY="your-api-key"
export GROQ_API_KEY="your-api-key"
export GOOGLE_API_KEY="your-api-key"import asyncio
from glutamate.core.codebases.codebase_service import CodebaseService
async def analyze_codebase():
# Initialize the service
service = CodebaseService(
root_dir="/path/to/your/project",
project_id="my_project",
llm_provider="openai", # or "anthropic", "groq", "gemini"
api_key="your-api-key",
model="gpt-4o"
)
# Index the codebase
service.index_codebase(path="/path/to/your/project")
# Generate docstrings with LLM
docstrings = await service.generate_docstrings(
repo_id="my_project",
generate_entry_point_flows=True,
generate_embeddings=True
)
print(f"Generated {len(docstrings)} docstrings")
# Search the codebase
results = await service.query(
query="How does the authentication system work?",
top_k=5,
threshold=0.7
)
print(f"Answer: {results['answer']}")
for result in results['results']:
print(f"- {result['name']} (score: {result['score']:.2f})")
# Run the analysis
asyncio.run(analyze_codebase())import asyncio
from glutamate.core.research_papers.ResearchPaper import ResearchPaper
async def analyze_paper():
# Initialize with LLM provider
paper = ResearchPaper(
llm_provider="groq",
api_key="your-api-key",
model="llama-3.2-3b-preview"
)
# Parse the paper
data = await paper.parse_paper("path/to/paper.pdf")
print(f"Parsed {data['content']['total_pages']} pages")
# Ask questions about the paper
answer, references = await paper.get_answer_with_context(
"What is the main contribution of this paper?"
)
print(f"Answer: {answer}")
print(f"References: {references}")
# Run the analysis
asyncio.run(analyze_paper())from glutamate.bummock.graphStore.neo4j_store import Neo4jStore
from glutamate.core.codebases.codebase_service import CodebaseService
# Using Neo4j for persistent storage
graph_store = Neo4jStore(
uri="bolt://localhost:7687",
username="neo4j",
password="password"
)
service = CodebaseService(
root_dir="/path/to/project",
project_id="my_project",
graph_store=graph_store,
llm_provider="openai"
)glutamate/
├── bummock/ # Reusable building blocks
│ ├── parsers/ # Document parsers (PDF, DOCX, etc.)
│ ├── llmAPIInterfaces/ # LLM provider abstractions
│ ├── graphStore/ # Graph database implementations
│ ├── vectorStore/ # Vector database implementations
│ └── reRanker/ # Re-ranking strategies
│
├── core/ # Domain-specific logic
│ ├── codebases/ # Code analysis & indexing
│ │ ├── parser.py # Tree-sitter code parsing
│ │ ├── codebase_service.py # Main service orchestrator
│ │ ├── docstring_generator.py # LLM-based documentation
│ │ └── embedding_generator.py # Semantic embeddings
│ │
│ ├── research_papers/ # Academic paper processing
│ │ └── ResearchPaper.py # Paper parsing & Q&A
│ │
│ └── common/ # Shared utilities
│ ├── _chunking.py # Chunking strategies
│ ├── _llmSectioning.py # Document sectioning
│ └── _llmLongContextRanking.py # Context ranking
│
└── tests/ # Test suite
Bummock (Building Blocks): Provider-agnostic, reusable components that can work with any LLM, database, or document format.
Core: Domain-specific logic that combines building blocks to solve particular problems (code analysis, paper processing, etc.).
- Python
- JavaScript / TypeScript
- Java
- Go
- Rust
- C / C++ / C#
- PHP
- Ruby
- Elixir
- Elm
- OCaml
- PDF (with text and image extraction)
- Microsoft Word (DOCX)
- PowerPoint (PPTX)
- Markdown
- JSON
- XML
- CSV
- Plain Text
| Component | Status | Completion |
|---|---|---|
| Codebase Analysis | 🟢 Active | 70% |
| Research Papers | 🟢 Active | 50% |
| LLM Interfaces | ✅ Stable | 90% |
| Document Parsers | 🟡 Beta | 60% |
| Graph Storage | ✅ Stable | 70% |
| Vector Storage | 🟡 Beta | 40% |
| RAG Pipeline | 🔴 Planning | 20% |
| Testing | 🟡 In Progress | 30% |
Q1 2025
- Complete RAG pipeline integration
- Configuration system for model/strategy selection
- Comprehensive test coverage
- API layer with FastAPI
- Usage documentation and tutorials
Q2 2025
- Web UI for document processing
- Chat history management
- Web search integration
- VSCode extension
- Batch processing support
Future
- Legal document processing
- Multi-modal support (images, audio)
- Obsidian plugin
- Desktop applications (Windows, Mac, Linux)
- Accuracy benchmarks and datasets
# Run all tests
pytest
# Run specific test suites
pytest tests/test_codebase.py
pytest tests/test_parser.py
# Run with coverage
pytest --cov=glutamate tests/Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
# Install development dependencies
poetry install --with dev
# Format code
black glutamate/
# Type checking
mypy glutamate/
# Run tests
pytestThis project is licensed under the MIT License - see the LICENSE file for details.
- Tree-sitter for powerful code parsing
- LangChain community for RAG inspiration
- Open source LLM providers for democratizing AI
- Author: Pramod G
- Email: aicodebox@gmail.com
- Issues: GitHub Issues
⭐ Star this repo if you find it useful! Contributions and feedback are always welcome.