🧬 GlutamateIndex

Intelligent document parsing and retrieval system optimized for RAG (Retrieval-Augmented Generation) applications

🎯 Overview

GlutamateIndex is a Python-based framework designed to parse, analyze, and retrieve information from various document types with specialized strategies optimized for each data format. It provides intelligent chunking, semantic search, and LLM-powered analysis for:

📚 Research Papers - Academic paper parsing with citation tracking
💻 Codebases - Source code analysis with automated documentation generation
📄 Documents - PDF, DOCX, PPTX, Markdown, and more
🗂️ Structured Data - JSON, XML, CSV processing
🔮 Future Support - Legal documents, chat histories, knowledge bases

✨ Key Features

🔍 Codebase Analysis (Most Advanced)

Multi-language parsing via Tree-sitter (Python, JavaScript/TypeScript, Java, Go, Rust, C/C++, PHP, Ruby, and more)
Automated docstring generation using LLMs with intelligent batching
Code graph construction - Maps relationships between classes, functions, and methods
Entry point detection - Identifies and documents code execution flows
Semantic code search with embedding-based retrieval
Support for NetworkX (in-memory) and Neo4j (persistent) graph stores

📖 Research Paper Processing

PDF parsing with metadata and structure extraction
LLM-powered Q&A with precise line-based citations
Context-aware chunking for large documents
Reference tracking and source attribution

🤖 LLM Provider Support

OpenAI (GPT-3.5, GPT-4, GPT-4o)
Anthropic (Claude 3.5 Sonnet, Opus, Haiku)
Google Gemini (Gemini Pro, Gemini 1.5)
Groq (Llama 3, Mixtral)
Ollama (Local models)

📝 Document Parsers

PDF (with image extraction)
Microsoft Word (DOCX)
PowerPoint (PPTX)
Markdown
JSON, XML, CSV
Plain text

🗄️ Storage Backends

Graph Stores: NetworkX, Neo4j
Vector Stores: ChromaDB, Qdrant, Elasticsearch
Re-rankers: Cohere, Voyage AI

🚀 Installation

Prerequisites

Python 3.12 or higher
Poetry (recommended) or pip

Using Poetry (Recommended)

# Clone the repository
git clone https://github.com/yourusername/glutamate.git
cd glutamate

# Install dependencies
poetry install

# Activate the virtual environment
poetry shell

Using pip

# Clone the repository
git clone https://github.com/yourusername/glutamate.git
cd glutamate

# Install dependencies
pip install -r requirement.txt

Optional Dependencies

For Neo4j graph storage:

# Ensure Neo4j is running
docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:latest

For specific LLM providers, set environment variables:

export OPENAI_API_KEY="your-api-key"
export ANTHROPIC_API_KEY="your-api-key"
export GROQ_API_KEY="your-api-key"
export GOOGLE_API_KEY="your-api-key"

📚 Usage

Codebase Indexing and Analysis

import asyncio
from glutamate.core.codebases.codebase_service import CodebaseService

async def analyze_codebase():
    # Initialize the service
    service = CodebaseService(
        root_dir="/path/to/your/project",
        project_id="my_project",
        llm_provider="openai",  # or "anthropic", "groq", "gemini"
        api_key="your-api-key",
        model="gpt-4o"
    )
    
    # Index the codebase
    service.index_codebase(path="/path/to/your/project")
    
    # Generate docstrings with LLM
    docstrings = await service.generate_docstrings(
        repo_id="my_project",
        generate_entry_point_flows=True,
        generate_embeddings=True
    )
    
    print(f"Generated {len(docstrings)} docstrings")
    
    # Search the codebase
    results = await service.query(
        query="How does the authentication system work?",
        top_k=5,
        threshold=0.7
    )
    
    print(f"Answer: {results['answer']}")
    for result in results['results']:
        print(f"- {result['name']} (score: {result['score']:.2f})")

# Run the analysis
asyncio.run(analyze_codebase())

Research Paper Processing

import asyncio
from glutamate.core.research_papers.ResearchPaper import ResearchPaper

async def analyze_paper():
    # Initialize with LLM provider
    paper = ResearchPaper(
        llm_provider="groq",
        api_key="your-api-key",
        model="llama-3.2-3b-preview"
    )
    
    # Parse the paper
    data = await paper.parse_paper("path/to/paper.pdf")
    print(f"Parsed {data['content']['total_pages']} pages")
    
    # Ask questions about the paper
    answer, references = await paper.get_answer_with_context(
        "What is the main contribution of this paper?"
    )
    
    print(f"Answer: {answer}")
    print(f"References: {references}")

# Run the analysis
asyncio.run(analyze_paper())

Using Different Graph Stores

from glutamate.bummock.graphStore.neo4j_store import Neo4jStore
from glutamate.core.codebases.codebase_service import CodebaseService

# Using Neo4j for persistent storage
graph_store = Neo4jStore(
    uri="bolt://localhost:7687",
    username="neo4j",
    password="password"
)

service = CodebaseService(
    root_dir="/path/to/project",
    project_id="my_project",
    graph_store=graph_store,
    llm_provider="openai"
)

🏗️ Architecture

glutamate/
├── bummock/              # Reusable building blocks
│   ├── parsers/          # Document parsers (PDF, DOCX, etc.)
│   ├── llmAPIInterfaces/ # LLM provider abstractions
│   ├── graphStore/       # Graph database implementations
│   ├── vectorStore/      # Vector database implementations
│   └── reRanker/         # Re-ranking strategies
│
├── core/                 # Domain-specific logic
│   ├── codebases/        # Code analysis & indexing
│   │   ├── parser.py              # Tree-sitter code parsing
│   │   ├── codebase_service.py   # Main service orchestrator
│   │   ├── docstring_generator.py # LLM-based documentation
│   │   └── embedding_generator.py # Semantic embeddings
│   │
│   ├── research_papers/  # Academic paper processing
│   │   └── ResearchPaper.py      # Paper parsing & Q&A
│   │
│   └── common/           # Shared utilities
│       ├── _chunking.py           # Chunking strategies
│       ├── _llmSectioning.py      # Document sectioning
│       └── _llmLongContextRanking.py # Context ranking
│
└── tests/                # Test suite

Design Philosophy

Bummock (Building Blocks): Provider-agnostic, reusable components that can work with any LLM, database, or document format.

Core: Domain-specific logic that combines building blocks to solve particular problems (code analysis, paper processing, etc.).

🗂️ Supported Formats

Code Languages (via Tree-sitter)

Python
JavaScript / TypeScript
Java
Go
Rust
C / C++ / C#
PHP
Ruby
Elixir
Elm
OCaml

Document Formats

PDF (with text and image extraction)
Microsoft Word (DOCX)
PowerPoint (PPTX)
Markdown
JSON
XML
CSV
Plain Text

🛣️ Roadmap

Current Status: Early Development (~30% Complete)

Component	Status	Completion
Codebase Analysis	🟢 Active	70%
Research Papers	🟢 Active	50%
LLM Interfaces	✅ Stable	90%
Document Parsers	🟡 Beta	60%
Graph Storage	✅ Stable	70%
Vector Storage	🟡 Beta	40%
RAG Pipeline	🔴 Planning	20%
Testing	🟡 In Progress	30%

Upcoming Features

Q1 2025

Complete RAG pipeline integration
Configuration system for model/strategy selection
Comprehensive test coverage
API layer with FastAPI
Usage documentation and tutorials

Q2 2025

Future

Legal document processing
Multi-modal support (images, audio)
Obsidian plugin
Desktop applications (Windows, Mac, Linux)
Accuracy benchmarks and datasets

🧪 Testing

# Run all tests
pytest

# Run specific test suites
pytest tests/test_codebase.py
pytest tests/test_parser.py

# Run with coverage
pytest --cov=glutamate tests/

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

# Install development dependencies
poetry install --with dev

# Format code
black glutamate/

# Type checking
mypy glutamate/

# Run tests
pytest

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Tree-sitter for powerful code parsing
LangChain community for RAG inspiration
Open source LLM providers for democratizing AI

📞 Contact

Author: Pramod G
Email: aicodebox@gmail.com
Issues: GitHub Issues

⭐ Star this repo if you find it useful! Contributions and feedback are always welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
glutamate		glutamate
refs		refs
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
paper_processing.log		paper_processing.log
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirement.txt		requirement.txt
sampleref.py		sampleref.py
steps.md		steps.md

Folders and files

Latest commit

History

Repository files navigation

🧬 GlutamateIndex

🎯 Overview

✨ Key Features

🔍 Codebase Analysis (Most Advanced)

📖 Research Paper Processing

🤖 LLM Provider Support

📝 Document Parsers

🗄️ Storage Backends

🚀 Installation

Prerequisites

Using Poetry (Recommended)

Using pip

Optional Dependencies

📚 Usage

Codebase Indexing and Analysis

Research Paper Processing

Using Different Graph Stores

🏗️ Architecture

Design Philosophy

🗂️ Supported Formats

Code Languages (via Tree-sitter)

Document Formats

🛣️ Roadmap

Current Status: Early Development (~30% Complete)

Upcoming Features

🧪 Testing

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

📞 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages