Skip to content

Shishir-Ashok/RAG

Repository files navigation

Retrieval-Augmented Generation (RAG) System with Semantic Search

A comprehensive implementation of a Retrieval-Augmented Generation (RAG) system that enables natural language querying over PDF documents using state-of-the-art embedding models and semantic similarity search.


Project Overview

This project demonstrates an end-to-end RAG pipeline that processes a PDF textbook (Introduction to Statistical Learning with Python), chunks the content into meaningful segments, generates dense vector embeddings, and enables semantic search to retrieve relevant passages based on user queries. The system showcases modern NLP techniques including document parsing, text chunking strategies, embedding generation with Hugging Face models, and efficient similarity search.


Key Features

Document Processing

  • Automated PDF download and text extraction using PyMuPDF (fitz)
  • Intelligent text cleaning and formatting
  • Statistical analysis of document structure (pages, characters, words, sentences, tokens)

Text Chunking Strategy

  • Sentence-level segmentation using spaCy's English sentencizer
  • Configurable chunk sizes (default: 10 sentences per chunk)
  • Token-based filtering to ensure meaningful content chunks (minimum 30 tokens)
  • Preservation of page numbers for source attribution

Embedding Generation

  • Model: Qwen3-Embedding-0.6B from Hugging Face
    • Parameters: 595 million
    • Embedding Dimensions: 1,024
    • Max Context Length: 32,768 tokens
  • GPU-accelerated encoding (CUDA support)
  • Batch processing for efficient embedding generation
  • Embeddings persisted to CSV for reusability

Semantic Search

  • Dot product similarity scoring using sentence-transformers utilities
  • Top-K retrieval of most relevant text chunks
  • Sub-millisecond query response times (~0.0004 seconds for 1,413 embeddings)
  • Page number tracking for source verification

Technical Architecture

Pipeline Workflow

  1. Data Acquisition: Downloads PDF from source URL or uses local copy
  2. Text Extraction: PyMuPDF extracts text page-by-page with metadata
  3. Sentence Segmentation: spaCy identifies sentence boundaries
  4. Chunking: Groups sentences into coherent 10-sentence chunks
  5. Embedding: Qwen3-Embedding-0.6B encodes chunks into 1,024-dimensional vectors
  6. Storage: Embeddings and metadata saved to CSV
  7. Query Processing: User queries embedded and compared via dot product similarity
  8. Retrieval: Top-K most similar chunks returned with page references

Technologies Used

  • Document Processing: PyMuPDF (fitz), requests
  • NLP: spaCy (English sentencizer), sentence-transformers
  • Embedding Model: Qwen/Qwen3-Embedding-0.6B (Hugging Face)
  • Machine Learning: PyTorch (CUDA-enabled), NumPy
  • Data Management: Pandas
  • Utilities: tqdm (progress tracking), textwrap (output formatting)

Dataset

Source Document: Introduction to Statistical Learning with Applications in Python (ISLP)

  • URL: https://hastie.su.domains/ISLP/ISLPwebsite.pdf
  • Pages: 613 (numbered -10 to 602)
  • Content: Comprehensive statistical learning textbook covering regression, classification, resampling, regularization, tree-based methods, SVM, deep learning, survival analysis, and more
  • Preprocessing:
    • 1,460 text chunks created
    • 1,413 chunks retained after filtering (≥30 tokens)
    • Average chunk: ~966 characters, ~163 words, ~241 tokens

Installation and Setup

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (recommended for embedding generation)
  • ~2GB disk space for model downloads

Dependencies

uv pip install torch sentence-transformers spacy pandas numpy pymupdf requests tqdm
python -m spacy download en_core_web_sm

Running the Notebook

  1. Open learn_01.ipynb in Jupyter Notebook or JupyterLab
  2. Execute cells sequentially:
    • Cell 1-2: Download and verify PDF
    • Cell 3-6: Extract and analyze text
    • Cell 7-10: Sentence segmentation with spaCy
    • Cell 11-17: Chunking and filtering
    • Cell 18-22: Embedding generation (GPU-accelerated, ~2.5 min)
    • Cell 23-29: Save embeddings to CSV
    • Cell 30-59: Semantic search demo

Usage Example

Query the System

from sentence_transformers import SentenceTransformer, util
import torch

# Load embedding model
embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", device="cuda")

# Example query
query = "Linear Models"
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# Compute similarity scores
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]

# Get top 5 results
top_results = torch.topk(dot_scores, k=5)

# Display results with page numbers
for score, idx in zip(top_results[0], top_results[1]):
    print(f"Score: {score:.4f}")
    print(f"Text: {pages_and_chunks[idx]['sentence_chunk']}")
    print(f"Page: {pages_and_chunks[idx]['page_number']}\n")

Sample Output

Query: Linear Models
Time: 0.00039 seconds

Score: 0.6750
Text: "134 3. Linear Regression d Is there evidence of non-linear association..."
Page: 132

Score: 0.6474
Text: "6 Linear Model Selection and Regularization In the regression setting..."
Page: 226

Performance Metrics

  • Document Processing: 613 pages processed in ~1 second
  • Embedding Generation: 1,413 chunks encoded in ~2 minutes 37 seconds (GPU)
  • Query Latency: <1 millisecond per search across 1,413 embeddings
  • Storage: Embeddings CSV file size ~2.5MB

Model Details

Qwen3-Embedding-0.6B

  • Developed by Alibaba Qwen team
  • Optimized for semantic similarity tasks
  • Supports multiple languages
  • Efficient inference on consumer GPUs (8GB VRAM sufficient)
  • Available on Hugging Face Model Hub

Limitations and Future Work

Current Limitations

  • No LLM integration for answer generation (retrieval-only system)
  • Static chunking strategy (10 sentences) may split related content
  • Dot product similarity only (no cross-encoder re-ranking)
  • Single document scope

Potential Enhancements

  • Integrate LLM (e.g., Qwen2.5, Llama 3) for generative question answering
  • Implement hybrid search (dense + sparse/BM25)
  • Add cross-encoder re-ranking for improved relevance
  • Support multi-document corpora with metadata filtering
  • Build interactive web interface (Streamlit/Gradio)
  • Experiment with chunking strategies (overlapping windows, semantic segmentation)

Repository Structure

.
├── learn_01.ipynb              # Main Jupyter notebook
├── ISLPwebsite.pdf             # Source document (downloaded)
├── pages_and_chunks_embeddings.csv  # Generated embeddings
└── README.md                   # This file

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published