A comprehensive implementation of a Retrieval-Augmented Generation (RAG) system that enables natural language querying over PDF documents using state-of-the-art embedding models and semantic similarity search.
This project demonstrates an end-to-end RAG pipeline that processes a PDF textbook (Introduction to Statistical Learning with Python), chunks the content into meaningful segments, generates dense vector embeddings, and enables semantic search to retrieve relevant passages based on user queries. The system showcases modern NLP techniques including document parsing, text chunking strategies, embedding generation with Hugging Face models, and efficient similarity search.
Document Processing
- Automated PDF download and text extraction using PyMuPDF (fitz)
- Intelligent text cleaning and formatting
- Statistical analysis of document structure (pages, characters, words, sentences, tokens)
Text Chunking Strategy
- Sentence-level segmentation using spaCy's English sentencizer
- Configurable chunk sizes (default: 10 sentences per chunk)
- Token-based filtering to ensure meaningful content chunks (minimum 30 tokens)
- Preservation of page numbers for source attribution
Embedding Generation
- Model: Qwen3-Embedding-0.6B from Hugging Face
- Parameters: 595 million
- Embedding Dimensions: 1,024
- Max Context Length: 32,768 tokens
- GPU-accelerated encoding (CUDA support)
- Batch processing for efficient embedding generation
- Embeddings persisted to CSV for reusability
Semantic Search
- Dot product similarity scoring using sentence-transformers utilities
- Top-K retrieval of most relevant text chunks
- Sub-millisecond query response times (~0.0004 seconds for 1,413 embeddings)
- Page number tracking for source verification
- Data Acquisition: Downloads PDF from source URL or uses local copy
- Text Extraction: PyMuPDF extracts text page-by-page with metadata
- Sentence Segmentation: spaCy identifies sentence boundaries
- Chunking: Groups sentences into coherent 10-sentence chunks
- Embedding: Qwen3-Embedding-0.6B encodes chunks into 1,024-dimensional vectors
- Storage: Embeddings and metadata saved to CSV
- Query Processing: User queries embedded and compared via dot product similarity
- Retrieval: Top-K most similar chunks returned with page references
- Document Processing: PyMuPDF (fitz), requests
- NLP: spaCy (English sentencizer), sentence-transformers
- Embedding Model: Qwen/Qwen3-Embedding-0.6B (Hugging Face)
- Machine Learning: PyTorch (CUDA-enabled), NumPy
- Data Management: Pandas
- Utilities: tqdm (progress tracking), textwrap (output formatting)
Source Document: Introduction to Statistical Learning with Applications in Python (ISLP)
- URL: https://hastie.su.domains/ISLP/ISLPwebsite.pdf
- Pages: 613 (numbered -10 to 602)
- Content: Comprehensive statistical learning textbook covering regression, classification, resampling, regularization, tree-based methods, SVM, deep learning, survival analysis, and more
- Preprocessing:
- 1,460 text chunks created
- 1,413 chunks retained after filtering (≥30 tokens)
- Average chunk: ~966 characters, ~163 words, ~241 tokens
- Python 3.8+
- CUDA-capable GPU (recommended for embedding generation)
- ~2GB disk space for model downloads
uv pip install torch sentence-transformers spacy pandas numpy pymupdf requests tqdm
python -m spacy download en_core_web_sm- Open
learn_01.ipynbin Jupyter Notebook or JupyterLab - Execute cells sequentially:
- Cell 1-2: Download and verify PDF
- Cell 3-6: Extract and analyze text
- Cell 7-10: Sentence segmentation with spaCy
- Cell 11-17: Chunking and filtering
- Cell 18-22: Embedding generation (GPU-accelerated, ~2.5 min)
- Cell 23-29: Save embeddings to CSV
- Cell 30-59: Semantic search demo
from sentence_transformers import SentenceTransformer, util
import torch
# Load embedding model
embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", device="cuda")
# Example query
query = "Linear Models"
query_embedding = embedding_model.encode(query, convert_to_tensor=True)
# Compute similarity scores
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
# Get top 5 results
top_results = torch.topk(dot_scores, k=5)
# Display results with page numbers
for score, idx in zip(top_results[0], top_results[1]):
print(f"Score: {score:.4f}")
print(f"Text: {pages_and_chunks[idx]['sentence_chunk']}")
print(f"Page: {pages_and_chunks[idx]['page_number']}\n")Query: Linear Models
Time: 0.00039 seconds
Score: 0.6750
Text: "134 3. Linear Regression d Is there evidence of non-linear association..."
Page: 132
Score: 0.6474
Text: "6 Linear Model Selection and Regularization In the regression setting..."
Page: 226
- Document Processing: 613 pages processed in ~1 second
- Embedding Generation: 1,413 chunks encoded in ~2 minutes 37 seconds (GPU)
- Query Latency: <1 millisecond per search across 1,413 embeddings
- Storage: Embeddings CSV file size ~2.5MB
Qwen3-Embedding-0.6B
- Developed by Alibaba Qwen team
- Optimized for semantic similarity tasks
- Supports multiple languages
- Efficient inference on consumer GPUs (8GB VRAM sufficient)
- Available on Hugging Face Model Hub
Current Limitations
- No LLM integration for answer generation (retrieval-only system)
- Static chunking strategy (10 sentences) may split related content
- Dot product similarity only (no cross-encoder re-ranking)
- Single document scope
Potential Enhancements
- Integrate LLM (e.g., Qwen2.5, Llama 3) for generative question answering
- Implement hybrid search (dense + sparse/BM25)
- Add cross-encoder re-ranking for improved relevance
- Support multi-document corpora with metadata filtering
- Build interactive web interface (Streamlit/Gradio)
- Experiment with chunking strategies (overlapping windows, semantic segmentation)
.
├── learn_01.ipynb # Main Jupyter notebook
├── ISLPwebsite.pdf # Source document (downloaded)
├── pages_and_chunks_embeddings.csv # Generated embeddings
└── README.md # This file
- Textbook: James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning with Applications in Python. Springer.
- Embedding Model: Qwen/Qwen3-Embedding-0.6B
- Sentence Transformers: sbert.net
- PyMuPDF: pymupdf.readthedocs.io