Skip to content

PGCodehub/GlutamateIndex

Repository files navigation

🧬 GlutamateIndex

Intelligent document parsing and retrieval system optimized for RAG (Retrieval-Augmented Generation) applications

Python 3.12+ License Code style: black

🎯 Overview

GlutamateIndex is a Python-based framework designed to parse, analyze, and retrieve information from various document types with specialized strategies optimized for each data format. It provides intelligent chunking, semantic search, and LLM-powered analysis for:

  • 📚 Research Papers - Academic paper parsing with citation tracking
  • 💻 Codebases - Source code analysis with automated documentation generation
  • 📄 Documents - PDF, DOCX, PPTX, Markdown, and more
  • 🗂️ Structured Data - JSON, XML, CSV processing
  • 🔮 Future Support - Legal documents, chat histories, knowledge bases

✨ Key Features

🔍 Codebase Analysis (Most Advanced)

  • Multi-language parsing via Tree-sitter (Python, JavaScript/TypeScript, Java, Go, Rust, C/C++, PHP, Ruby, and more)
  • Automated docstring generation using LLMs with intelligent batching
  • Code graph construction - Maps relationships between classes, functions, and methods
  • Entry point detection - Identifies and documents code execution flows
  • Semantic code search with embedding-based retrieval
  • Support for NetworkX (in-memory) and Neo4j (persistent) graph stores

📖 Research Paper Processing

  • PDF parsing with metadata and structure extraction
  • LLM-powered Q&A with precise line-based citations
  • Context-aware chunking for large documents
  • Reference tracking and source attribution

🤖 LLM Provider Support

  • OpenAI (GPT-3.5, GPT-4, GPT-4o)
  • Anthropic (Claude 3.5 Sonnet, Opus, Haiku)
  • Google Gemini (Gemini Pro, Gemini 1.5)
  • Groq (Llama 3, Mixtral)
  • Ollama (Local models)

📝 Document Parsers

  • PDF (with image extraction)
  • Microsoft Word (DOCX)
  • PowerPoint (PPTX)
  • Markdown
  • JSON, XML, CSV
  • Plain text

🗄️ Storage Backends

  • Graph Stores: NetworkX, Neo4j
  • Vector Stores: ChromaDB, Qdrant, Elasticsearch
  • Re-rankers: Cohere, Voyage AI

🚀 Installation

Prerequisites

  • Python 3.12 or higher
  • Poetry (recommended) or pip

Using Poetry (Recommended)

# Clone the repository
git clone https://github.com/yourusername/glutamate.git
cd glutamate

# Install dependencies
poetry install

# Activate the virtual environment
poetry shell

Using pip

# Clone the repository
git clone https://github.com/yourusername/glutamate.git
cd glutamate

# Install dependencies
pip install -r requirement.txt

Optional Dependencies

For Neo4j graph storage:

# Ensure Neo4j is running
docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:latest

For specific LLM providers, set environment variables:

export OPENAI_API_KEY="your-api-key"
export ANTHROPIC_API_KEY="your-api-key"
export GROQ_API_KEY="your-api-key"
export GOOGLE_API_KEY="your-api-key"

📚 Usage

Codebase Indexing and Analysis

import asyncio
from glutamate.core.codebases.codebase_service import CodebaseService

async def analyze_codebase():
    # Initialize the service
    service = CodebaseService(
        root_dir="/path/to/your/project",
        project_id="my_project",
        llm_provider="openai",  # or "anthropic", "groq", "gemini"
        api_key="your-api-key",
        model="gpt-4o"
    )
    
    # Index the codebase
    service.index_codebase(path="/path/to/your/project")
    
    # Generate docstrings with LLM
    docstrings = await service.generate_docstrings(
        repo_id="my_project",
        generate_entry_point_flows=True,
        generate_embeddings=True
    )
    
    print(f"Generated {len(docstrings)} docstrings")
    
    # Search the codebase
    results = await service.query(
        query="How does the authentication system work?",
        top_k=5,
        threshold=0.7
    )
    
    print(f"Answer: {results['answer']}")
    for result in results['results']:
        print(f"- {result['name']} (score: {result['score']:.2f})")

# Run the analysis
asyncio.run(analyze_codebase())

Research Paper Processing

import asyncio
from glutamate.core.research_papers.ResearchPaper import ResearchPaper

async def analyze_paper():
    # Initialize with LLM provider
    paper = ResearchPaper(
        llm_provider="groq",
        api_key="your-api-key",
        model="llama-3.2-3b-preview"
    )
    
    # Parse the paper
    data = await paper.parse_paper("path/to/paper.pdf")
    print(f"Parsed {data['content']['total_pages']} pages")
    
    # Ask questions about the paper
    answer, references = await paper.get_answer_with_context(
        "What is the main contribution of this paper?"
    )
    
    print(f"Answer: {answer}")
    print(f"References: {references}")

# Run the analysis
asyncio.run(analyze_paper())

Using Different Graph Stores

from glutamate.bummock.graphStore.neo4j_store import Neo4jStore
from glutamate.core.codebases.codebase_service import CodebaseService

# Using Neo4j for persistent storage
graph_store = Neo4jStore(
    uri="bolt://localhost:7687",
    username="neo4j",
    password="password"
)

service = CodebaseService(
    root_dir="/path/to/project",
    project_id="my_project",
    graph_store=graph_store,
    llm_provider="openai"
)

🏗️ Architecture

glutamate/
├── bummock/              # Reusable building blocks
│   ├── parsers/          # Document parsers (PDF, DOCX, etc.)
│   ├── llmAPIInterfaces/ # LLM provider abstractions
│   ├── graphStore/       # Graph database implementations
│   ├── vectorStore/      # Vector database implementations
│   └── reRanker/         # Re-ranking strategies
│
├── core/                 # Domain-specific logic
│   ├── codebases/        # Code analysis & indexing
│   │   ├── parser.py              # Tree-sitter code parsing
│   │   ├── codebase_service.py   # Main service orchestrator
│   │   ├── docstring_generator.py # LLM-based documentation
│   │   └── embedding_generator.py # Semantic embeddings
│   │
│   ├── research_papers/  # Academic paper processing
│   │   └── ResearchPaper.py      # Paper parsing & Q&A
│   │
│   └── common/           # Shared utilities
│       ├── _chunking.py           # Chunking strategies
│       ├── _llmSectioning.py      # Document sectioning
│       └── _llmLongContextRanking.py # Context ranking
│
└── tests/                # Test suite

Design Philosophy

Bummock (Building Blocks): Provider-agnostic, reusable components that can work with any LLM, database, or document format.

Core: Domain-specific logic that combines building blocks to solve particular problems (code analysis, paper processing, etc.).

🗂️ Supported Formats

Code Languages (via Tree-sitter)

  • Python
  • JavaScript / TypeScript
  • Java
  • Go
  • Rust
  • C / C++ / C#
  • PHP
  • Ruby
  • Elixir
  • Elm
  • OCaml

Document Formats

  • PDF (with text and image extraction)
  • Microsoft Word (DOCX)
  • PowerPoint (PPTX)
  • Markdown
  • JSON
  • XML
  • CSV
  • Plain Text

🛣️ Roadmap

Current Status: Early Development (~30% Complete)

Component Status Completion
Codebase Analysis 🟢 Active 70%
Research Papers 🟢 Active 50%
LLM Interfaces ✅ Stable 90%
Document Parsers 🟡 Beta 60%
Graph Storage ✅ Stable 70%
Vector Storage 🟡 Beta 40%
RAG Pipeline 🔴 Planning 20%
Testing 🟡 In Progress 30%

Upcoming Features

Q1 2025

  • Complete RAG pipeline integration
  • Configuration system for model/strategy selection
  • Comprehensive test coverage
  • API layer with FastAPI
  • Usage documentation and tutorials

Q2 2025

  • Web UI for document processing
  • Chat history management
  • Web search integration
  • VSCode extension
  • Batch processing support

Future

  • Legal document processing
  • Multi-modal support (images, audio)
  • Obsidian plugin
  • Desktop applications (Windows, Mac, Linux)
  • Accuracy benchmarks and datasets

🧪 Testing

# Run all tests
pytest

# Run specific test suites
pytest tests/test_codebase.py
pytest tests/test_parser.py

# Run with coverage
pytest --cov=glutamate tests/

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

# Install development dependencies
poetry install --with dev

# Format code
black glutamate/

# Type checking
mypy glutamate/

# Run tests
pytest

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Tree-sitter for powerful code parsing
  • LangChain community for RAG inspiration
  • Open source LLM providers for democratizing AI

📞 Contact


Star this repo if you find it useful! Contributions and feedback are always welcome.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors