A sophisticated Retrieval-Augmented Generation (RAG) system that analyzes GitHub repositories, generates code, and answers questions using Groq LLM, FAISS vector store, and HuggingFace embeddings.
🌐 Try It Here Live
- 📂 Smart Repository Crawling: Automatically processes multiple file types (Python, JavaScript, Java, C++, docs, etc.)
- 🧠 Language-Aware Processing: Uses different text splitters based on file types for optimal chunking
- ⚡ Fast LLM Inference: Powered by Groq for quick and accurate responses
- 🔍 Efficient Vector Search: FAISS for lightning-fast similarity search
- 📝 Rich Source Attribution: Shows exactly which files and code sections were used to answer questions
- 🧩 Repository Analysis: Automated analysis of project structure and components
- 💻 Code Generation: Generate code snippets or functions using Groq LLM prompt templates
├── config.py # Configuration and API keys
├── github_repository.py # GitHub API handling and repository crawling
├── document_processor.py # Document chunking and processing
├── embedding_manager.py # FAISS vector store and embeddings
├── llm_manager.py # Groq LLM integration for GitHub repo
├── code_generate.py # Groq LLM for code generation using prompt templates
├── rag_system.py # Main RAG pipeline orchestration
├── main.py # Streamlit web application
├── requirements.txt # Python dependencies
└── README.md # This file
pip install -r requirements.txtEdit config.py and replace the placeholder values:
@dataclass
class Config:
GROQ_API_KEY: str = "your_actual_groq_api_key_here" # Get from https://console.groq.com/
GITHUB_TOKEN: str = "your_actual_github_token_here" # Optional but recommended
# ... other settingsGetting API Keys:
- 🔑 Groq API Key: Sign up here and generate an API key
- 🐙 GitHub Token: GitHub Settings → Developer Settings → Personal Access Tokens → Generate new token (classic). Only requires public repo read access
streamlit run main.pyThe app will open in your browser at http://localhost:8501
- Enter Repository URL: Paste a GitHub repo URL in the sidebar (e.g.,
https://github.com/langchain-ai/langchain) - Process Repository: Click "Process Repository" to crawl and analyze the repo. (May take a few minutes for large repos)
- Ask Questions: Once processed, ask questions like:
- "How does authentication work?"
- "What are the main components?"
- "How do I set up this project?"
- "Which testing frameworks are used?"
- Generate Code: Use the Code Generation tab to create functions, scripts, or snippets using Groq LLM prompt templates
- View Sources: Each answer shows the specific files and code sections that were used to generate the response
- GitHubRepository: Handles GitHub API interactions, file filtering, and crawling
- AdvancedDocumentProcessor: Applies language-specific text splitting
- EmbeddingManager: Manages HuggingFace embeddings and FAISS vector store
- LLMManager: Interfaces with Groq LLM for responses
- AdvancedRAGSystem: Orchestrates the RAG pipeline
- CodeGenerator: Generates code based on Groq LLM prompt templates
- Repository Crawling: Fetches files with intelligent filtering
- Document Processing: Splits documents using language-aware strategies
- Embedding Creation: Generates embeddings via HuggingFace
- Vector Storage: Stores embeddings in FAISS for similarity search
- Query Processing: Retrieves relevant chunks and generates responses using Groq
- Code Generation: Uses prompt templates to generate code snippets or functions
- Programming Languages: Python, JavaScript/TypeScript, Java, C/C++, C#, PHP, Ruby, Go, Rust, Scala, Kotlin
- Documentation: Markdown, reStructuredText, Plain Text
- Configuration: JSON, YAML, TOML, INI
- Web: HTML, CSS, SCSS
- Special Files: README, LICENSE, CHANGELOG, DOCKERFILE, MAKEFILE
CHUNK_SIZE: Size of text chunks (default: 1000)CHUNK_OVERLAP: Overlap between chunks (default: 200)TOP_K_RETRIEVAL: Number of relevant chunks to retrieve (default: 5)TEMPERATURE: LLM response randomness (default: 0.3)MAX_TOKENS: Maximum response length (default: 2048)
- API Key Errors: Ensure Groq API key is valid in
config.py - Repository Access: Some repos may be private or restricted
- Large Repositories: May take longer and require more memory
- Rate Limits: Using a GitHub token avoids hitting API limits
- Use GitHub tokens for better API rate limits
- Test on smaller repos first
- Groq provides fast inference
- FAISS ensures efficient similarity search
🌐 GitHub Repository
│
▼
📂 Repository Crawling 💻 Code Generation
(GitHub API fetch + file filtering) (Prompt templates with Groq)
│ │
▼ |
Advanced Document Processing │
(Language-aware chunking & preprocessing) │
│ |
▼ ▼
Embedding Creation (HuggingFace) User Input
│ (Get query from user)
▼ |
Vector Storage (FAISS Indexing) |
│ |
| │
▼ |
Query Processing (RAG Q&A) 🛠 Generated Code Snippets
(Retrieve relevant chunks) (Functions, scripts, examples)
│ |
▼ |
LLM Response Generation |
(Groq LLM answers questions) |
│ |
────────────>✅ User Interface (Streamlit)✅<────────────
Legend / Highlights:
- 🌐 Source GitHub repo
- 📂 Crawling & filtering
- 🧠 Language-specific processing
- 🔗 Embeddings creation for semantic search
- 📊 FAISS for fast retrieval
- 📖 Q&A for GitHub knowledge
- 💻 Code Generation using Groq LLM prompt templates
- ✅ Streamlit interface for easy access
- streamlit: Web interface
- langchain: Document processing & splitting
- groq: LLM inference
- sentence-transformers: Embeddings
- faiss-cpu: Vector similarity search
- requests: GitHub API calls
✅ Now, you can explore repositories, ask questions, and even generate code all in one place!