Skip to content

nimonkaranurag/langchain-study

Repository files navigation

Index

Quick Run

For an enhanced learning experience, you can set-up Claude Desktop and ask it questions about the README or other repo implementations!

Click here to see how to set-up an MCP server that you can integrate with any MCP client!

Frontend Experience (Recommended!)

There's a beautiful frontend I have designed for ease of navigation using streamlit that you can access (or, you could skip using it and just use the CLI instead). Here's the steps to launch it: Run the following commands in a Python (>=3.12) environment:

pip install -e .
pip install tqdm
streamlit run frontend/_🙋🏽<200d>♂️_welcome.py

Expected Output

welcome hr assistant study assistant search assistant

Environment Setup

Create a .env file at the project root with the following properties:

  • Sign into Groq (this is the inference endpoint provider/model provider default) and create an .env file with the key:

    GROQ_API_KEY=<your_groq_api_key>
    
  • For tracing with langsmith, set-up a .env file at the project root that has the following keys:

    LANGSMITH_TRACING=true
    LANGSMITH_API_KEY=<your_langsmith_api_key>
    LANGSMITH_PROJECT=langchain-study
    
  • To use the react/search agent, you must also sign up with Tavily and provide:

    TAVILY_API_KEY=<your_tavily_api_key>
    • The study assistant also routes to this agent for web searches.
  • Set the logging level using:

    export LANGCHAIN_STUDY_LOG_LEVEL=DEBUG
  • The HR assistant can perform look-ups on a PINECONE Vector DB, to utilise this feature - provide your API KEY and index name like so:

    PINECONE_API_KEY=<your_api_key>
    PINECONE_INDEX_NAME=langchain-study
    • For the study_assistant, the langchain docs are stored in a separate namespace within Pinecone.
    • Effectively, three separate namespaces are created in the Pinecone index set in the .env file: "repo-readme-<<today_date_stamp_in_iso_format>>", "study-assistant-batch-<<today_date_stamp_in_iso_format>>" and "hr-policies-<<today_date_stamp_in_iso_format>>".

How to run and use assistants

HR Assitant

a basic HR assistant, for now - it can:

  • apply for time-off requests for "nimo@ibm.com"
  • explain company policy, sample query: "Can you tell me the company policy on data privacy?"

Run the following commands in a Python (>=3.12) environment:

pip install -e .
python -m assistants.hr_assistant.hr_policies_ingestor
python -m assistants.hr_assistant.main

Expected Output

hr assistant

Search Assistant

a ReAct seach agent with structured outputs, can perform summarization of search results, powered by TavilySearch

Run the following commands in a Python (>=3.12) environment:

pip install -e .
python -m assistants.search_assistant.main

Expected Output

react agent

Study Assistant

a study assistant with structured outputs, can perform the following operations:

  • help navigate the langchain-study repo ("how-to") instructions
  • help study langchain!
  • Note: If ingesting langchain documents leads to errors in the free tier, my langchain notes should provide some basic insight at a minimum.

Run the following commands in a Python (>=3.12) environment:

pip install -e .
pip install tqdm
python -m assistants.study_assistant.study_material_ingestor
python -m assistants.study_assistant.main

Expected Output

study agent

study agent

Langchain Notes

  • A langchain chain is a sequence of operations where the O/P of one step is the I/P to the next step.

    • A "step" in a chain can be anything: an LLM call, some data transformation, a tool call, another chain, etc.
  • Most langchain components like langchain.prompts.PromptTemplate, etc. implement the Runnable interface.

    • All Runnable types expose an invoke(..) method.
    • Runnable types can be piped using LangChain Expression Language(LCEL) to define a RunnableSequence (this is a "chain"):
      prompt_template = PromptTemplate(...)
      llm = ChatOllama(..)
      
      chain = prompt_template | llm
      response = chain.invoke(input_variables: dict = {...})
      • In this example, the Runnable sequence comprises a PromptTemplate in the first step. The invoke(...) method implemented for a PromptTemplate type accepts a dictionary of input_variables to be rendered into the prompt.
      • Alternatively, this rendering of template variables can also be done directly by calling the format(..) method:
        prompt_template_string = PromptTemplate(...).format(**input_variables: dict)
      • The invoke(...) method implementation for the PromptTemplate is defined in one of its parent interfaces called the BasePromptTemplate which initializes a RunnableConfig and returns a PromptValue (the formatted prompt).
      • You can also use the invoke(...) method directly:
        prompt_template_string = PromptTemplate(...).invoke(input: dict = input_variables).to_string()
        
    • A RunnableSequence type when "invoked" using the invoke(...) method:
      • accepts the args expected by the first Runnable in the sequence.
      • returns the value of the last Runnable type's invoke(...).
      • In the above example, the first Runnable is a PromptTemplate type which expects a dict of input_variables and, the last Runnable is a ChatOllama type which returns an AIMessage response object.
      • Langchain is able to adapt the O/P of one Runnable to the expected I/P format of the next in a RunnableSequence.
  • The langchain hub object offers access to community langchain resources like prompts, etc.

    from langchain import hub
  • LangChain offers a create_react_agent(...) method which provides a clean interface for building LangChain ReAct agents.

    from langchain.agents.react.agent import create_react_agent
    • create_react_agent(...) returns a Runnable which is referred to as a "LangChain ReAct agent".
  • For actual execution of tool calls, langchain provides an AgentExecutor as the agent runtime.

    from langchain.agents import AgentExecutor
  • For implementing tools in langchain, it offers a @tool decorator which can be imported like so:

    from langchain_core.tools import tool
    
    @tool
    def multiply_two_integers(
        a: int,
        b: int,
    ) -> int:
        """
        Returns the result of a*b.
        """
        return a*b
    • LangChain will extract a schema for this "tool" for agent use:
      • name
      • description
      • expected args
    • The model itself doesn't execute a tool, this is done by the langchain backend. The model only produces the arguments need for the call.
  • The RunnableLambda wrapper allows you type-cast any Python function(usually a lambda function) into a langchain Runnable.

    • Basically, it will make the invoke(..) method call available on that function.
    • Since it can be "invoked": it can be chained into a RunnableSequence using LCEL.
    from langchain_core.runnables import RunnableLambda

Markdown Document Handling

  • For chunking documents with markdown formatting, langchain offers some useful resources:
    from langchain_text_splitters import (
        MarkdownHeaderTextSplitter
    )
    • A MarkdownHeaderTextSplitter is used in conjuction with some loading mechanism which preserves the original markdown formatting to read the document as a string.
    • For example, you can use Python's built-in file handling:
      with open(md_file_path, "r") as f:
          document: str = f.read()
    • NOTE: These tools can be used to load a markdown file as a langchain Document but they do not preserve the original markdown formatting and hence, cannot be used in conjunction with the MarkdownHeaderTextSplitter however, could be useful:
      from langchain_core.documents import Document
      from langchain_community.document_loaders import UnstructuredMarkdownLoader
      
      document_loader = UnstructuredMarkdownLoader(
          file_path=md_file_path,
          mode="single",
          # for chunking the document at this step, you can also pass the mode = "elements".
      )
      document: list[Document] = list(document_loader.lazy_load())
      document_content: str = document[0].page_content # only 1 chunk (the entire document).
      document_metadata: dict = document[0].metadata # only contains a "source" key which maps to the path this document was loaded from.
      • For mode="single": the entire document is loaded as-is without chunking (1 item in the list[Document]).
      • For mode="elements": the entire document is loaded as chunks, broken using the internal parsing logic into categories such as: Text, NarrativeText and UncategorizedText (multiple items in the list[Document]).
  • Once the document has been loaded, it can be split by the headers in the following manner:
    # Step 1
    # Initialize splitter
    document_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[
                ("##", "Section"),
                ("###", "Policy"),
        ],
        strip_headers=False,
    )
    # Step 2
    # Split document into chunks
    chunked_document: list[Document] = document_splitter.split_text(document)
    • The strip_headers flag can be used to control whether the actual header split on will be included in the .page_content field of the different Document chunks:
      # when strip_headers = False
      Chunk 1
      ### Policy 1 <-- this text is not included if strip_headers = True
      ....
      
      Chunk 2
      ### Policy 2
      ....
    • The name field of the list of tuples to split on is set as a key in the .metadata field of each Document chunk:
      Chunk 1
      {
          'Section': 'Header starting with ##',
          'Policy': 'Policy 1'
      }
    • In this case, the lowest "level" header has three hashtags: "###" therefore, len(chunked_documents) = number of headers at the level ###.
    • Headers with a higher level (# and ##) are included in the first chunk (assuming the remaining document only has headers at the level ###).
  • For additional chunking, you can either use langchain_text_splitters.CharacterTextSplitter or alternatively, langchain_text_splitters.RecursiveCharacterTextSplitter(recommended) for more intelligent parsing of text; the RecursiveCharacterTextSplitter has stricter enforcement of the chunk_size since it uses fallback delimiters if a chunk exceeds the given limit.

ReAct architecture

  • A ReAct styled agent implements the following loop for query resolution:
    Thought -> Action & Action Input(s) -> Observation -> Thought -> (...) -> Final Answer
    
  • Initially, "tool calling" was implemented through text-based prompting, for example:
    You have access to the following tools:
    
    {tools}
    
    Use the following format:
    
    Question: the problem you are trying to solve for.
    Thought: your thought on the appropriate action.
    Action: the recommended action, must be one of {tools}
    Action Input: the input to be provided for executing the action
    Observation: the result of the action
    (this pattern can repeat N times)
    
    When you have resolved the question:
        Thought: I now think I have a final answer
        Final Answer: the final answer to the question
    
    Begin!
    
    Question: {question}
    Thought: {agent_scratch_pad}
    
    • the Action and Action Input were then parsed by the system backend to finally execute the "tool call".
    • The agent_scratch_pad is initially an empty string:
      ""
      • The agent_scratch_pad captures the context of the model's recommended Action, Action Input and Observation thus far.
      • It expands like so:
        Action: <tool_name>
        Action Input: <tool_args>
        Observation: <tool_output>
        (...)
        Action: <tool_name>
        Action Input: <tool_args>
        Observation: <tool_output>
        Thought:
        
        # the `Thought:` is always an empty placeholder prompt appended after the last `Observation` (at the end).
      • So the prompt iself would go from: first iteration
        <aforementioned ReAct instructions>
        
        (...)
        
        Begin!
        
        Question: <user_query>
        Thought:
        n-th iteration
        <aforementioned ReAct instructions>
        
        (...)
        
        Begin!
        
        Action: <tool_name>
        Action Input: <tool_args>
        Observation: <tool_output>
        (...)
        Action: <tool_name>
        Action Input: <tool_args>
        Observation: <tool_output>
        Thought:
    • To support structured LLM outputs, langchain provides output_parsers that work with pydantic data models to generate expected formats:
      from langchain_core.output_parsers.pydantic import PydanticOutputParser
      
      output_parser = PydanticOutputParser(
          pydantic_object=<your_data_model>
      )
      # this is only an output parser.
      
      prompt_template_with_format_instructions = PromptTemplate(...).partial(
          <input_variable_used_as_placeholder_for_instructions_in_text_template>=output_parser.get_format_instructions()
      )
      # the expectation set by `get_format_instructions()` is: the LLM will generate an output in a JSON format
      • The .partial(<input_variable>=<variable_value>,...) is useful for returning a "partially formatted" PromptTemplate => this is useful when you have some variable values beforehand but not all.
      • REMEMBER: this will return a PromptTemplate while .format(...) returns a string.
    • Using "pydantic parsing" in the aforementioned fashion is not favoured because: in this case, we are expecting the LLM to comply with the formatting requirements.
      • Depending on the model size (if a smaller model like gpt-3 is used), this might not always work and will often lead to parsing errors.
      • This is a prompt-level/text-based enforcement of the expected output format and not very robust. How to built the AgentExecutor from scratch?
      • This can be done using a simple while loop, expanding the {agent_scratchpad} with subsequent conversation turns and parsing the output at each agent step to execute tools.
      • Previously, you would have to supply a "Stop Token" like so:
        llm = ChatOpenAI(
            **kwargs,
            stop=[
                "...",
                "\nObservation:", # for example
                "...", # can pass any number of "stop tokens"
            ]
        )
        • The stop= argument defines a list of tokens at which point the LLM stops generating further text.
        • Setting this to "\nObservation:" will cause:
          ❌ Without stop sequence:
              Thought: I need to search for this
              Action: Wikipedia
              Action Input: "Python programming"
              Observation: Python is a high-level language... [HALLUCINATED!]
              Thought: Based on that... [reasoning based on made-up data]
          
          ✅ With stop=["\nObservation:"]:
              Thought: I need to search for this
              Action: Wikipedia  
              Action Input: "Python programming"
              [STOPS HERE - agent executes tool]
              Observation: [REAL Wikipedia result inserted by framework]  
              [Feed back to LLM for next iteration]
        • The "stop token" itself is not included in the response returned by the LLM; generation stops at the token just before it.
        • The problem with this approach was un-timely and unexpected parsing issues, which further supported the move to native-function calling.
      • The langchain framework would then parse the: "\nAction:" and "\nAction Input:" recommended by the LLM and then execute it to continue the ReAct loop.
        • Langchain offers built-in parsing which you can use to build your own ReAct loop; the parser will output either of these two objects based on the LLM's response:

          • AgentAction : if the model recommends an \nAction: with an \nAction Input:
          • AgentFinish : if the model declares that it knows the "final answer".

          In the backend, all of this is done through RegEx parsing.

          This is useful for individual agent step parsing viz a viz the final output parsing discussed so far.

  • The landscape has evolved now, modern LLMs support native function calling.
    • As an extension to the above discussion of "text-based" tool-calling, there is now a more modern and favoured approach for models which support this feature:
      llm = ChatGroq(model="openai/gpt-oss-120b")
      structured_llm = llm.with_structured_output(
          <your_pydantic_data_model>
      )
      • Models which support native function/tool calling can be chained with the .with_structured_output(..) call for formatting the response returned.
      • Now, you can:
        • remove the explicit "formatting_instructions" (saving input tokens) passed via text to the "regular" llm.
        • use the structured_llm to format the output response using the simple text response returned by the "regular" llm.
      • This is also preferred because:
        • easier to use (plug and play parsing)
        • higher reliability
        • all modern state of the art models support native function calling
      • .with_structured_output(..) falls back to using output parsers when native function calling isn't supported by the model!
    • Having native tool calling support implies that the model provider offers a tools= argument through their own API.
    • For production-grade scaling of LLM applications, this is the recommended practice: rely on the model provider to give you structured JSON outputs through their native function calling capabilities.
    • The only drawback of using a vendor-provided function calling capability is that the reasoning process of the LLM is a black box however, the tool selection process is considered reliable enough to depend on.
      • The models are fine-tuned by the vendors to be able to select the right tool for a task.
      • However the reasoning process itself is a blackbox from a debugging perspective.
      • For fine-grained control: use Chain of Thought prompting with few-shot examples.
      • If you want the vendor to do the "heavy lifting" of tool selection then use native function calling.
      • The LLM doesn't actually make the tool call. The onus is still on you to take the args returned by the vendor and actually execute the tool.
    • langchain provides a unified interface for using vendor provided native tool calling capabilities of LLMs.
      • ChatModel.bind_tools() to attach tool definitions to model calls.
      • Then the AIMessage returned by the model would have an AIMessage.tool_calls which is an attribute for easily accessing the tool calls the model has decided to make.
  • "Agents" in the langchain ecosystem have evolved in the following manner:
    Text-based tool calling through prompt itself ->
    Native function call supported LLMs as agents ->
    LangGraph ReAct agents for production demands.
    

Chain of Thought Prompting

  • The first evolution of prompt engineering came from understanding that LLMs gave different (showed better fitting outputs for a use-case) through selection of an appropriate number of examples:
    • Zero-shot prompting (no examples)
      User: Describe a natural scene for me
    • One-shot prompting (1 example)
      User: Describe a natural scene for me, here are some adjectives to use in your description:
          vivid, sunny, fresh
    • Few-shot prompting (> 1 example)
  • Then came CoT Prompting which proved the hypothesis that typing out the "reasoning" in a question-answer style in the query helped better/more accurate outputs:
    • Zero-shot CoT prompting:
      User: How would you learn maths from scratch?
      Let's think through this step by step.
      
    • Few-shot CoT prompting:
      User: Q: How would you learn math from scratch?
      A: I would hire a tutor, go to university and pursue a degree in maths.
      
      Q: How would you learn english from scratch?
      A: I would converse in english more often, read more literature and study etymology so I can correlate english origins to my native tongue.
      
      Q: How would you learn history from scratch?
      A: 
      
  • Combining CoT Prompting + the ability to execute/act on that reasoning lead to the ReAct architecture, thus using LLMs as the "reasoning engine".

RAG Notes

  • Problem Statement: When the answer to our question exists inside a specific part of a large document, we need an effective mechanism for breaking up the document into chunks.
    • The entire document as a whole is too big to fit within the token limit.
    • Using multiple API calls to send chunked versions of the document in sequence is also unnecessary since the information needed is condensed in a specific part.
    • There is a solution to this problem and it is called "retrieval augmentation".
  • You can use an embedding model to generate vector embeddings out of a document and store it in a vector database.
    • A user query on the document like:
      Tell me what happens in page 546 of book XYZ?
      
      • is also converted into an embedding vector and stored in the database.
      • now, you can calculate the nearest neighbours of the user query to get relevant chunks of the document stored in the DB.
    • a good embedding model will be able to organize the vectors in the vector space so that semantically similar vectors are close to each other.
    • now, we can simply add the relevant chunks of the document as "context" and send it to our LLM.

Reflection vs Reflexion

Reflection Architecture

  • A reflection agent is a heavy duty solution for enhancing the quality of LLM responses.
  • How it works:
    • A "generator" agent gets a user query -> makes a decision:
      • tool call?
      • generate a response in natural language?
    • When the generator agent is ready (after possible tool calls) to output a response -> this is piped to a "critique".
    • The critique makes a decision:
      • is it approved? -> display to the user!
      • can it be improved? -> get criticism -> feed back to the generator.

The graph below shows a reflection cycle that lasts 3 improvement interations (this can be controlled programmatically using LangGraph or, left up to the LLM's decision making).

graph TB
    Start([User Query]) --> Gen1[Generator AgentFirst Draft]
    
    Gen1 --> Draft1{Draft Response 1}
    
    Draft1 --> Critic1[Critic AgentReviews Draft 1]
    
    Critic1 --> Eval1{Quality Check}
    
    Eval1 -->|Not ApprovedIteration 1| Feedback1[Generate Critique:'Too technical, lacks examples']
    
    Feedback1 --> Gen2[Generator AgentRefines with Feedback]
    
    Gen2 --> Draft2{Draft Response 2}
    
    Draft2 --> Critic2[Critic AgentReviews Draft 2]
    
    Critic2 --> Eval2{Quality Check}
    
    Eval2 -->|Not ApprovedIteration 2| Feedback2[Generate Critique:'Better, but needs sources']
    
    Feedback2 --> Gen3[Generator AgentFinal Refinement]
    
    Gen3 --> Draft3{Draft Response 3}
    
    Draft3 --> Critic3[Critic AgentFinal Review]
    
    Critic3 --> Eval3{Quality Check}
    
    Eval3 -->|Approved OR Max Iterations|Output([Final Output to User])
    
    Eval1 -.->|If Approved Early| Output
    Eval2 -.->|If Approved Early| Output
    
    style Gen1 fill:#e1f5ff
    style Gen2 fill:#e1f5ff
    style Gen3 fill:#e1f5ff
    style Critic1 fill:#fff4e1
    style Critic2 fill:#fff4e1
    style Critic3 fill:#fff4e1
    style Output fill:#d4edda
    style Start fill:#f8f9fa
    
    classDef iteration fill:#ffe1e1
    class Feedback1,Feedback2 iteration
    
    note2[🔄 Iteration Loop:Each cycle refines based on immediate critic feedback]
    note3[⚠️ Cost: 6-12 LLM calls for 3-6 iterations]
Loading

Key Characteristics:

  • 2 LLM calls per iteration
  • No cross-session memory
  • Expensive but high quality
  • Context limited to current conversation

Reflexion/Self-Reflection Architecture

  • Here, we have a persistent "critique bank" (typically, a vector index which houses critcisms to the "generator" responses).
  • The prompt for the "generator" has a special section where the criticism is rendered in:
REFLEXION_PROMPT = """You are an assistant that learns from past mistakes.

PAST REFLECTIONS (learn from these):
{reflections}
---

Current Task: {user_query}

Based on past failures above, proceed carefully and avoid similar mistakes."""
  • The vector index is queried at every user turn and semantically similar/relevant "reflections" are used to populate the PAST REFLECTIONS in the prompt!
  • In this case, there's a second component: the "Self-Reflection" LLM.
  • Unlike the critic in Reflection architecture (which reviews EVERY response), the Self-Reflection LLM is only triggered when the Evaluator detects a failure.
  • When triggered, it analyzes WHY the failure occurred and generates a verbal reflection for future reference.
  • The "reflection" is stored in the vector index and becomes available in future user interactions (not the current turn).
  • The "reflection" is only available in the next turn/next time the user interacts instead of an immediate feedback cycle like in the reflection architecture!

Goal: Learn from past failures across multiple user interactions to avoid repeating mistakes.

graph TB
    subgraph Episode1[Episode 1: Initial Failure]
        Start1([User Query 1:'How to use LangGraph?']) --> Gen1[Generator Agent Attempts Response]
        Gen1 --> Output1[Response:'LangGraph is for...HALLUCINATED CONTENT']
        Output1 --> Eval1{Automated Evaluator Ground Truth Check}
        Eval1 -->|❌ Failure Detected| Reflect1[Self-Reflecting LLM]
        Reflect1 --> Analysis1["Verbal Reflection:'I hallucinated because I didn'tsearch documentation. Next time,ALWAYS use doc search tool firstfor technical framework questions.'"]
        Analysis1 --> Store1[(Vector DatabaseReflection Storage)]
    end
    
    subgraph Memory[Persistent Memory Layer]
        Store1 --> VectorDB[(Reflection Index:
        • Task patterns
        • Failure modes
        • Learned lessons
        • Corrective strategies)]
    end
    
    subgraph Episode2[Episode 2: Learned Behavior]
        Start2([User Query 2:'Explain LangGraph nodes']) --> Retrieve[Semantic Search:Find Similar Past Failures]
        VectorDB -.->|Retrieve Relevant Reflections| Retrieve
        Retrieve --> Context["Augmented Prompt:PAST REFLECTIONS:'For framework questions,search docs first...'CURRENT TASK:'Explain LangGraph nodes'"]
        Context --> Gen2[Generator Agent WITH Past Lessons]
        Gen2 --> Tool[🔧 Uses Doc Search Tool Learned from Reflection!]
        Tool --> Output2[Response:Accurate, grounded in docs ✓]
        Output2 --> Eval2{Automated Evaluator}
        Eval2 -->|✅ Success!| NoReflect[No New Reflection Needed]
    end
    
    Store1 -.->|Persists Across Sessions| VectorDB
    
    style Gen1 fill:#ffe1e1
    style Output1 fill:#ffe1e1
    style Reflect1 fill:#fff4e1
    style Analysis1 fill:#fff4e1
    style Store1 fill:#e1e8ff
    style VectorDB fill:#e1e8ff
    style Gen2 fill:#d4edda
    style Output2 fill:#d4edda
    style Tool fill:#d4edda
    style Start1 fill:#f8f9fa
    style Start2 fill:#f8f9fa
    
    note2[🧠 Memory Persists:Reflections stored in vector DB,retrieved by semantic similarity]
    note3[⚡ Efficiency: 1 extra LLM callonly on failures, then benefitcompounds over time]
Loading

Key Characteristics:

  • Single-pass per episode

  • Learns across sessions

  • Builds institutional memory

  • Amortized cost improvement

  • The Automated Evaluator can be:

    • Unit tests (for code generation tasks)
    • Ground truth comparison (for factual questions)
    • Reward signal (for RL-style tasks)
    • Human feedback (less common, but possible)
    • Key point: It's not necessarily an LLM - it's any success/failure check.

Agentic AI Notes

  • Providing links to resources (real and accurate references) for answers is called "grounding".
    • Grounding an agent is important for building trust with the customer.
    • This helps validate that the answer is not a model hallucination.

About

a playground for foundational agentic ai exploration

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors