Askly/
├── config/
│ ├── __init__.py
│ └── config.py # Configuration settings
├── processors/
│ ├── __init__.py
│ ├── pdf_processor.py # PDF downloading and text extraction
│ └── text_processor.py # Text processing and chunking
├── models/
│ ├── __init__.py
│ ├── embedding_manager.py # Embedding creation and management
│ ├── retrieval_system.py # Semantic search and retrieval
│ └── llm_manager.py # LLM loading and text generation
├── utils/
│ ├── __init__.py
│ └── utils.py # Utility functions
├── rag_pipeline.py # Main pipeline orchestrator
├── data/ # PDF files and raw data
├── models/ # Downloaded models
├── outputs/ # Generated embeddings and outputs
├── main.py # Command-line interface
├── run_rag.py # Simple runner script
└── README.md # This file
rag_pipeline.py: Main orchestrator that coordinates all componentsmain.py: Command-line interface with multiple run modesrun_rag.py: Simple script for quick testing
config/config.py: All configuration settings, paths, and parameters
processors/pdf_processor.py: Downloads PDFs and extracts textprocessors/text_processor.py: Processes text, splits sentences, creates chunks
models/embedding_manager.py: Creates and manages text embeddingsmodels/retrieval_system.py: Performs semantic search and retrievalmodels/llm_manager.py: Loads and manages the language model
utils/utils.py: Helper functions for text processing, model management, etc.
# Interactive mode
python main.py --mode interactive
# Demo mode with predefined questions
python main.py --mode demo
# Single question
python main.py --mode single --question "What are macronutrients?"
# Custom settings
python main.py --mode single --question "What is protein?" --temperature 0.5 --max-tokens 512# Quick start
python run_rag.pyfrom src.rag_pipeline import RAGPipeline
# Initialize pipeline
pipeline = RAGPipeline()
# Setup (downloads PDF, creates embeddings, loads models)
pipeline.setup_pipeline()
# Ask questions
answer = pipeline.ask("What are the macronutrients?")
print(answer)
# Search without generation
results = pipeline.search("protein sources")- Modular Design: Each component is separate and can be used independently
- Configuration Management: All settings centralized in config.py
- Error Handling: Comprehensive error handling throughout
- Multiple Interfaces: CLI, programmatic, and interactive modes
- GPU Support: Automatic GPU detection and model optimization
- Caching: Saves embeddings to avoid recomputation
- Extensible: Easy to add new models or processing steps
The code requires the same dependencies as the original notebook:
- PyMuPDF (fitz)
- sentence-transformers
- transformers
- torch
- pandas
- numpy
- spacy
- tqdm
- requests
Each module has a specific responsibility:
- PDF Processing: Downloads and extracts text from PDFs
- Text Processing: Splits text into sentences and chunks
- Embedding Management: Creates and stores text embeddings
- Retrieval: Finds relevant documents for queries
- LLM Management: Generates answers using language models
- Pipeline: Orchestrates all components together