Skip to content

An exploration of advanced text splitting strategies in LangChain for RAG, from basic character splitting to state-of-the-art semantic chunking.

Notifications You must be signed in to change notification settings

jsonusuman351/Langchain_Text_Splitter

Repository files navigation

🔪 A Deep Dive into Text Splitting with LangChain

LangChain OpenAI FAISS PyPDF Python-Dotenv

Hey there! Welcome to my hands-on exploration of a crucial, yet often underestimated, part of building RAG systems: Text Splitting. I realized early on that you can't just dump a whole book into an LLM's context window. To build effective retrieval systems, you need to break down large documents into small, meaningful chunks.

This repository is my personal journey and a collection of scripts where I experiment with different text splitting strategies available in LangChain. I've covered everything from the most basic character-level splitting to advanced semantic chunking.


🤔 The "Why" Behind Text Splitting

Before building any RAG application, you have to prepare your data. LLMs have a limited context window (the amount of text they can "see" at once). If your document is too long, you have to split it. But how you split it matters.

  • Too small chunks? You lose important context.
  • Too large chunks? They won't fit in the prompt.
  • Bad splits? You might cut a sentence in half, destroying its meaning.

Finding the right splitting strategy is the key to high-quality retrieval.


✨ Core Splitting Strategies I've Explored

Here’s a breakdown of the different text splitters I've experimented with in this repository:

  1. Character-Based Splitting (length_textsplitter.py):

    • This is the simplest method. The CharacterTextSplitter chunks the text based on a fixed number of characters (e.g., every 100 characters). It's a blunt instrument but a great starting point to understand the basics of chunk size and overlap.
  2. Structure-Aware Splitting (text_structure_based.py):

    • This is a much smarter approach. The RecursiveCharacterTextSplitter tries to split text along natural boundaries. It has a prioritized list of separators (like \n\n, \n, ., ) and tries to use the most logical one first. This helps keep related paragraphs and sentences together.
  3. Code-Aware Splitting (python_code_splitting.py):

    • Splitting code is a unique challenge because you need to preserve its structure (classes, functions, etc.). Here, I used a RecursiveCharacterTextSplitter specifically configured with Python-aware separators to intelligently chunk a Python script without breaking its syntax.
  4. Semantic Splitting (semantic_meaning_based.py):

    • This is the most advanced technique I explored. Instead of relying on characters or separators, the SemanticChunker uses an embedding model to find "semantic breakpoints" in the text. It groups sentences that are contextually related, resulting in chunks that are topically coherent. This is the state-of-the-art for preserving meaning.

🛠️ Tech Stack

  • Core Framework: LangChain
  • LLM & Embedding Provider: OpenAI
  • Vector Store: FAISS (for semantic splitting)
  • Document Loading: PyPDF
  • Core Libraries: langchain-core, langchain-community, langchain-openai, python-dotenv

⚙️ Setup and Installation

  1. Clone the repository:

    git clone [https://github.com/jsonusuman351/Langchain_Text_Splitter.git](https://github.com/jsonusuman351/Langchain_Text_Splitter.git)
    cd Langchain_Text_Splitter
  2. Create and activate a virtual environment:

    # It is recommended to use Python 3.10 or higher
    python -m venv venv
    .\venv\Scripts\activate
  3. Install the required packages:

    pip install -r requirements.txt
  4. Set Up Environment Variables:

    • Create a file named .env in the root directory.
    • Add your OpenAI API key to this file:
      OPENAI_API_KEY="your-openai-api-key"

🚀 How to Run the Scripts

Each script is a self-contained example. Feel free to run them to see how each text splitter chunks the sample documents.

  • Simple Character Splitting:
    python length_textsplitter.py
  • Recursive (Structure-Aware) Splitting:
    python text_structure_based.py
  • Python Code Splitting:
    python python_code_splitting.py
  • Advanced Semantic Splitting:
    python semantic_meaning_based.py

🔬 A Tour of My Splitting Experiments

I've organized the scripts based on the core splitting technique they demonstrate, making it easy to compare the results.

Click to view the code layout
Langchain_Text_Splitter/
│
├── length_textsplitter.py      # Basic: Splits by character count
├── text_structure_based.py   # Better: Splits by text structure (paragraphs, sentences)
├── python_code_splitting.py    # Specialized: Splits Python code intelligently
├── semantic_meaning_based.py # Advanced: Splits by semantic similarity
│
├── pdf_collection/             # Sample PDF files for testing
├── cricket.txt                 # Sample text file
│
├── requirements.txt
├── .env                        # (You need to create this for your API key)
└── README.md


About

An exploration of advanced text splitting strategies in LangChain for RAG, from basic character splitting to state-of-the-art semantic chunking.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages