Retrieval-Augmented Generation (RAG) System with Semantic Search

A comprehensive implementation of a Retrieval-Augmented Generation (RAG) system that enables natural language querying over PDF documents using state-of-the-art embedding models and semantic similarity search.

Project Overview

This project demonstrates an end-to-end RAG pipeline that processes a PDF textbook (Introduction to Statistical Learning with Python), chunks the content into meaningful segments, generates dense vector embeddings, and enables semantic search to retrieve relevant passages based on user queries. The system showcases modern NLP techniques including document parsing, text chunking strategies, embedding generation with Hugging Face models, and efficient similarity search.

Key Features

Document Processing

Automated PDF download and text extraction using PyMuPDF (fitz)
Intelligent text cleaning and formatting
Statistical analysis of document structure (pages, characters, words, sentences, tokens)

Text Chunking Strategy

Sentence-level segmentation using spaCy's English sentencizer
Configurable chunk sizes (default: 10 sentences per chunk)
Token-based filtering to ensure meaningful content chunks (minimum 30 tokens)
Preservation of page numbers for source attribution

Embedding Generation

Model: Qwen3-Embedding-0.6B from Hugging Face
- Parameters: 595 million
- Embedding Dimensions: 1,024
- Max Context Length: 32,768 tokens
GPU-accelerated encoding (CUDA support)
Batch processing for efficient embedding generation
Embeddings persisted to CSV for reusability

Semantic Search

Dot product similarity scoring using sentence-transformers utilities
Top-K retrieval of most relevant text chunks
Sub-millisecond query response times (~0.0004 seconds for 1,413 embeddings)
Page number tracking for source verification

Technical Architecture

Pipeline Workflow

Data Acquisition: Downloads PDF from source URL or uses local copy
Text Extraction: PyMuPDF extracts text page-by-page with metadata
Sentence Segmentation: spaCy identifies sentence boundaries
Chunking: Groups sentences into coherent 10-sentence chunks
Embedding: Qwen3-Embedding-0.6B encodes chunks into 1,024-dimensional vectors
Storage: Embeddings and metadata saved to CSV
Query Processing: User queries embedded and compared via dot product similarity
Retrieval: Top-K most similar chunks returned with page references

Technologies Used

Document Processing: PyMuPDF (fitz), requests
NLP: spaCy (English sentencizer), sentence-transformers
Embedding Model: Qwen/Qwen3-Embedding-0.6B (Hugging Face)
Machine Learning: PyTorch (CUDA-enabled), NumPy
Data Management: Pandas
Utilities: tqdm (progress tracking), textwrap (output formatting)

Dataset

Source Document: Introduction to Statistical Learning with Applications in Python (ISLP)

URL: https://hastie.su.domains/ISLP/ISLPwebsite.pdf
Pages: 613 (numbered -10 to 602)
Content: Comprehensive statistical learning textbook covering regression, classification, resampling, regularization, tree-based methods, SVM, deep learning, survival analysis, and more
Preprocessing:
- 1,460 text chunks created
- 1,413 chunks retained after filtering (≥30 tokens)
- Average chunk: ~966 characters, ~163 words, ~241 tokens

Installation and Setup

Prerequisites

Python 3.8+
CUDA-capable GPU (recommended for embedding generation)
~2GB disk space for model downloads

Dependencies

uv pip install torch sentence-transformers spacy pandas numpy pymupdf requests tqdm
python -m spacy download en_core_web_sm

Running the Notebook

Open learn_01.ipynb in Jupyter Notebook or JupyterLab
Execute cells sequentially:
- Cell 1-2: Download and verify PDF
- Cell 3-6: Extract and analyze text
- Cell 7-10: Sentence segmentation with spaCy
- Cell 11-17: Chunking and filtering
- Cell 18-22: Embedding generation (GPU-accelerated, ~2.5 min)
- Cell 23-29: Save embeddings to CSV
- Cell 30-59: Semantic search demo

Usage Example

Query the System

from sentence_transformers import SentenceTransformer, util
import torch

# Load embedding model
embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", device="cuda")

# Example query
query = "Linear Models"
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# Compute similarity scores
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]

# Get top 5 results
top_results = torch.topk(dot_scores, k=5)

# Display results with page numbers
for score, idx in zip(top_results[0], top_results[1]):
    print(f"Score: {score:.4f}")
    print(f"Text: {pages_and_chunks[idx]['sentence_chunk']}")
    print(f"Page: {pages_and_chunks[idx]['page_number']}\n")

Sample Output

Query: Linear Models
Time: 0.00039 seconds

Score: 0.6750
Text: "134 3. Linear Regression d Is there evidence of non-linear association..."
Page: 132

Score: 0.6474
Text: "6 Linear Model Selection and Regularization In the regression setting..."
Page: 226

Performance Metrics

Document Processing: 613 pages processed in ~1 second
Embedding Generation: 1,413 chunks encoded in ~2 minutes 37 seconds (GPU)
Query Latency: <1 millisecond per search across 1,413 embeddings
Storage: Embeddings CSV file size ~2.5MB

Model Details

Qwen3-Embedding-0.6B

Developed by Alibaba Qwen team
Optimized for semantic similarity tasks
Supports multiple languages
Efficient inference on consumer GPUs (8GB VRAM sufficient)
Available on Hugging Face Model Hub

Limitations and Future Work

Current Limitations

No LLM integration for answer generation (retrieval-only system)
Static chunking strategy (10 sentences) may split related content
Dot product similarity only (no cross-encoder re-ranking)
Single document scope

Potential Enhancements

Integrate LLM (e.g., Qwen2.5, Llama 3) for generative question answering
Implement hybrid search (dense + sparse/BM25)
Add cross-encoder re-ranking for improved relevance
Support multi-document corpora with metadata filtering
Build interactive web interface (Streamlit/Gradio)
Experiment with chunking strategies (overlapping windows, semantic segmentation)

Repository Structure

.
├── learn_01.ipynb              # Main Jupyter notebook
├── ISLPwebsite.pdf             # Source document (downloaded)
├── pages_and_chunks_embeddings.csv  # Generated embeddings
└── README.md                   # This file

References

Textbook: James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning with Applications in Python. Springer.
Embedding Model: Qwen/Qwen3-Embedding-0.6B
Sentence Transformers: sbert.net
PyMuPDF: pymupdf.readthedocs.io

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
learn_01.ipynb		learn_01.ipynb
pages_and_chunks_embeddings.csv		pages_and_chunks_embeddings.csv
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval-Augmented Generation (RAG) System with Semantic Search

Project Overview

Key Features

Technical Architecture

Pipeline Workflow

Technologies Used

Dataset

Installation and Setup

Prerequisites

Dependencies

Running the Notebook

Usage Example

Query the System

Sample Output

Performance Metrics

Model Details

Limitations and Future Work

Repository Structure

References

About

Uh oh!

Releases

Packages

Languages

Shishir-Ashok/RAG

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Generation (RAG) System with Semantic Search

Project Overview

Key Features

Technical Architecture

Pipeline Workflow

Technologies Used

Dataset

Installation and Setup

Prerequisites

Dependencies

Running the Notebook

Usage Example

Query the System

Sample Output

Performance Metrics

Model Details

Limitations and Future Work

Repository Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages