Hey there! Welcome to my hands-on exploration of a crucial, yet often underestimated, part of building RAG systems: Text Splitting. I realized early on that you can't just dump a whole book into an LLM's context window. To build effective retrieval systems, you need to break down large documents into small, meaningful chunks.
This repository is my personal journey and a collection of scripts where I experiment with different text splitting strategies available in LangChain. I've covered everything from the most basic character-level splitting to advanced semantic chunking.
Before building any RAG application, you have to prepare your data. LLMs have a limited context window (the amount of text they can "see" at once). If your document is too long, you have to split it. But how you split it matters.
- Too small chunks? You lose important context.
- Too large chunks? They won't fit in the prompt.
- Bad splits? You might cut a sentence in half, destroying its meaning.
Finding the right splitting strategy is the key to high-quality retrieval.
Here’s a breakdown of the different text splitters I've experimented with in this repository:
-
Character-Based Splitting (
length_textsplitter.py
):- This is the simplest method. The
CharacterTextSplitter
chunks the text based on a fixed number of characters (e.g., every 100 characters). It's a blunt instrument but a great starting point to understand the basics of chunk size and overlap.
- This is the simplest method. The
-
Structure-Aware Splitting (
text_structure_based.py
):- This is a much smarter approach. The
RecursiveCharacterTextSplitter
tries to split text along natural boundaries. It has a prioritized list of separators (like\n\n
,\n
,.
,
- This is a much smarter approach. The
-
Code-Aware Splitting (
python_code_splitting.py
):- Splitting code is a unique challenge because you need to preserve its structure (classes, functions, etc.). Here, I used a
RecursiveCharacterTextSplitter
specifically configured with Python-aware separators to intelligently chunk a Python script without breaking its syntax.
- Splitting code is a unique challenge because you need to preserve its structure (classes, functions, etc.). Here, I used a
-
Semantic Splitting (
semantic_meaning_based.py
):- This is the most advanced technique I explored. Instead of relying on characters or separators, the
SemanticChunker
uses an embedding model to find "semantic breakpoints" in the text. It groups sentences that are contextually related, resulting in chunks that are topically coherent. This is the state-of-the-art for preserving meaning.
- This is the most advanced technique I explored. Instead of relying on characters or separators, the
- Core Framework: LangChain
- LLM & Embedding Provider: OpenAI
- Vector Store: FAISS (for semantic splitting)
- Document Loading: PyPDF
- Core Libraries:
langchain-core
,langchain-community
,langchain-openai
,python-dotenv
-
Clone the repository:
git clone [https://github.com/jsonusuman351/Langchain_Text_Splitter.git](https://github.com/jsonusuman351/Langchain_Text_Splitter.git) cd Langchain_Text_Splitter
-
Create and activate a virtual environment:
# It is recommended to use Python 3.10 or higher python -m venv venv .\venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
-
Set Up Environment Variables:
- Create a file named
.env
in the root directory. - Add your OpenAI API key to this file:
OPENAI_API_KEY="your-openai-api-key"
- Create a file named
Each script is a self-contained example. Feel free to run them to see how each text splitter chunks the sample documents.
- Simple Character Splitting:
python length_textsplitter.py
- Recursive (Structure-Aware) Splitting:
python text_structure_based.py
- Python Code Splitting:
python python_code_splitting.py
- Advanced Semantic Splitting:
python semantic_meaning_based.py
I've organized the scripts based on the core splitting technique they demonstrate, making it easy to compare the results.
Click to view the code layout
Langchain_Text_Splitter/
│
├── length_textsplitter.py # Basic: Splits by character count
├── text_structure_based.py # Better: Splits by text structure (paragraphs, sentences)
├── python_code_splitting.py # Specialized: Splits Python code intelligently
├── semantic_meaning_based.py # Advanced: Splits by semantic similarity
│
├── pdf_collection/ # Sample PDF files for testing
├── cricket.txt # Sample text file
│
├── requirements.txt
├── .env # (You need to create this for your API key)
└── README.md