🔪 A Deep Dive into Text Splitting with LangChain

Hey there! Welcome to my hands-on exploration of a crucial, yet often underestimated, part of building RAG systems: Text Splitting. I realized early on that you can't just dump a whole book into an LLM's context window. To build effective retrieval systems, you need to break down large documents into small, meaningful chunks.

This repository is my personal journey and a collection of scripts where I experiment with different text splitting strategies available in LangChain. I've covered everything from the most basic character-level splitting to advanced semantic chunking.

🤔 The "Why" Behind Text Splitting

Before building any RAG application, you have to prepare your data. LLMs have a limited context window (the amount of text they can "see" at once). If your document is too long, you have to split it. But how you split it matters.

Too small chunks? You lose important context.
Too large chunks? They won't fit in the prompt.
Bad splits? You might cut a sentence in half, destroying its meaning.

Finding the right splitting strategy is the key to high-quality retrieval.

✨ Core Splitting Strategies I've Explored

Here’s a breakdown of the different text splitters I've experimented with in this repository:

Character-Based Splitting (length_textsplitter.py):
- This is the simplest method. The CharacterTextSplitter chunks the text based on a fixed number of characters (e.g., every 100 characters). It's a blunt instrument but a great starting point to understand the basics of chunk size and overlap.
Structure-Aware Splitting (text_structure_based.py):
- This is a much smarter approach. The RecursiveCharacterTextSplitter tries to split text along natural boundaries. It has a prioritized list of separators (like \n\n, \n, ., ) and tries to use the most logical one first. This helps keep related paragraphs and sentences together.
Code-Aware Splitting (python_code_splitting.py):
- Splitting code is a unique challenge because you need to preserve its structure (classes, functions, etc.). Here, I used a RecursiveCharacterTextSplitter specifically configured with Python-aware separators to intelligently chunk a Python script without breaking its syntax.
Semantic Splitting (semantic_meaning_based.py):
- This is the most advanced technique I explored. Instead of relying on characters or separators, the SemanticChunker uses an embedding model to find "semantic breakpoints" in the text. It groups sentences that are contextually related, resulting in chunks that are topically coherent. This is the state-of-the-art for preserving meaning.

🛠️ Tech Stack

Core Framework: LangChain
LLM & Embedding Provider: OpenAI
Vector Store: FAISS (for semantic splitting)
Document Loading: PyPDF
Core Libraries: langchain-core, langchain-community, langchain-openai, python-dotenv

⚙️ Setup and Installation

Clone the repository:

git clone [https://github.com/jsonusuman351/Langchain_Text_Splitter.git](https://github.com/jsonusuman351/Langchain_Text_Splitter.git)
cd Langchain_Text_Splitter

Create and activate a virtual environment:

# It is recommended to use Python 3.10 or higher
python -m venv venv
.\venv\Scripts\activate

Install the required packages:
```
pip install -r requirements.txt
```
Set Up Environment Variables:
- Create a file named .env in the root directory.
- Add your OpenAI API key to this file:
```
OPENAI_API_KEY="your-openai-api-key"
```

🚀 How to Run the Scripts

Each script is a self-contained example. Feel free to run them to see how each text splitter chunks the sample documents.

Simple Character Splitting:
```
python length_textsplitter.py
```
Recursive (Structure-Aware) Splitting:
```
python text_structure_based.py
```
Python Code Splitting:
```
python python_code_splitting.py
```
Advanced Semantic Splitting:
```
python semantic_meaning_based.py
```

🔬 A Tour of My Splitting Experiments

I've organized the scripts based on the core splitting technique they demonstrate, making it easy to compare the results.

Click to view the code layout

Langchain_Text_Splitter/
│
├── length_textsplitter.py      # Basic: Splits by character count
├── text_structure_based.py   # Better: Splits by text structure (paragraphs, sentences)
├── python_code_splitting.py    # Specialized: Splits Python code intelligently
├── semantic_meaning_based.py # Advanced: Splits by semantic similarity
│
├── pdf_collection/             # Sample PDF files for testing
├── cricket.txt                 # Sample text file
│
├── requirements.txt
├── .env                        # (You need to create this for your API key)
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔪 A Deep Dive into Text Splitting with LangChain

🤔 The "Why" Behind Text Splitting

✨ Core Splitting Strategies I've Explored

🛠️ Tech Stack

⚙️ Setup and Installation

🚀 How to Run the Scripts

🔬 A Tour of My Splitting Experiments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
pdf_collection		pdf_collection
.gitignore		.gitignore
README.md		README.md
cricket.txt		cricket.txt
length_textsplitter.py		length_textsplitter.py
project_document.pdf		project_document.pdf
python_code_splitting.py		python_code_splitting.py
requirements.txt		requirements.txt
semantic_meaning_based.py		semantic_meaning_based.py
text_structure_based.py		text_structure_based.py

jsonusuman351/Langchain_Text_Splitter

Folders and files

Latest commit

History

Repository files navigation

🔪 A Deep Dive into Text Splitting with LangChain

🤔 The "Why" Behind Text Splitting

✨ Core Splitting Strategies I've Explored

🛠️ Tech Stack

⚙️ Setup and Installation

🚀 How to Run the Scripts

🔬 A Tour of My Splitting Experiments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages