- Quick Run
- Launch Frontend UI
- How to setup environment
- How to launch/use assistants
- Langchain Notes
- ReAct Notes
- RAG Notes
- Reflection vs Self Reflection/"Reflexion"
- Agentic AI Notes
For an enhanced learning experience, you can set-up Claude Desktop and ask it questions about the README or other repo implementations!
Click here to see how to set-up an MCP server that you can integrate with any MCP client!
There's a beautiful frontend I have designed for ease of navigation using streamlit that you can access (or, you could skip using it and just use the CLI instead). Here's the steps to launch it:
Run the following commands in a Python (>=3.12) environment:
pip install -e .
pip install tqdm
streamlit run frontend/_🙋🏽<200d>♂️_welcome.pyCreate a .env file at the project root with the following properties:
-
Sign into
Groq(this is the inference endpoint provider/model provider default) and create an.envfile with the key:GROQ_API_KEY=<your_groq_api_key> -
For tracing with
langsmith, set-up a.envfile at the project root that has the following keys:LANGSMITH_TRACING=true LANGSMITH_API_KEY=<your_langsmith_api_key> LANGSMITH_PROJECT=langchain-study -
To use the react/search agent, you must also sign up with Tavily and provide:
TAVILY_API_KEY=<your_tavily_api_key>
- The study assistant also routes to this agent for web searches.
-
Set the logging level using:
export LANGCHAIN_STUDY_LOG_LEVEL=DEBUG -
The HR assistant can perform look-ups on a PINECONE Vector DB, to utilise this feature - provide your API KEY and index name like so:
PINECONE_API_KEY=<your_api_key> PINECONE_INDEX_NAME=langchain-study
- For the
study_assistant, the langchain docs are stored in a separate namespace within Pinecone. - Effectively, three separate namespaces are created in the Pinecone index set in the
.envfile:"repo-readme-<<today_date_stamp_in_iso_format>>","study-assistant-batch-<<today_date_stamp_in_iso_format>>"and"hr-policies-<<today_date_stamp_in_iso_format>>".
- For the
a basic HR assistant, for now - it can:
- apply for time-off requests for
"nimo@ibm.com" - explain company policy, sample query:
"Can you tell me the company policy on data privacy?"
Run the following commands in a Python (>=3.12) environment:
pip install -e .python -m assistants.hr_assistant.hr_policies_ingestorpython -m assistants.hr_assistant.maina ReAct seach agent with structured outputs, can perform summarization of search results, powered by TavilySearch
Run the following commands in a Python (>=3.12) environment:
pip install -e .
python -m assistants.search_assistant.maina study assistant with structured outputs, can perform the following operations:
- help navigate the langchain-study repo ("how-to") instructions
- help study langchain!
- Note: If ingesting langchain documents leads to errors in the free tier, my langchain notes should provide some basic insight at a minimum.
Run the following commands in a Python (>=3.12) environment:
pip install -e .pip install tqdmpython -m assistants.study_assistant.study_material_ingestorpython -m assistants.study_assistant.main-
A langchain
chainis a sequence of operations where the O/P of one step is the I/P to the next step.- A "step" in a chain can be anything: an LLM call, some data transformation, a tool call, another chain, etc.
-
Most langchain components like
langchain.prompts.PromptTemplate, etc. implement theRunnableinterface.- All
Runnabletypes expose aninvoke(..)method. Runnabletypes can be piped usingLangChain Expression Language(LCEL)to define aRunnableSequence(this is a "chain"):prompt_template = PromptTemplate(...) llm = ChatOllama(..) chain = prompt_template | llm response = chain.invoke(input_variables: dict = {...})
- In this example, the
Runnablesequence comprises aPromptTemplatein the first step. Theinvoke(...)method implemented for aPromptTemplatetype accepts a dictionary ofinput_variablesto be rendered into the prompt. - Alternatively, this rendering of template variables can also be done directly by calling the
format(..)method:prompt_template_string = PromptTemplate(...).format(**input_variables: dict)
- The
invoke(...)method implementation for thePromptTemplateis defined in one of its parent interfaces called theBasePromptTemplatewhich initializes aRunnableConfigand returns aPromptValue(the formatted prompt). - You can also use the
invoke(...)method directly:prompt_template_string = PromptTemplate(...).invoke(input: dict = input_variables).to_string()
- In this example, the
- A
RunnableSequencetype when "invoked" using theinvoke(...)method:- accepts the args expected by the first
Runnablein the sequence. - returns the value of the last
Runnabletype'sinvoke(...). - In the above example, the first
Runnableis aPromptTemplatetype which expects adictofinput_variablesand, the lastRunnableis aChatOllamatype which returns anAIMessageresponse object. - Langchain is able to adapt the O/P of one
Runnableto the expected I/P format of the next in aRunnableSequence.
- accepts the args expected by the first
- All
-
The langchain
hubobject offers access to community langchain resources like prompts, etc.from langchain import hub
-
LangChain offers a
create_react_agent(...)method which provides a clean interface for building LangChain ReAct agents.from langchain.agents.react.agent import create_react_agent
create_react_agent(...)returns aRunnablewhich is referred to as a "LangChain ReAct agent".
-
For actual execution of tool calls, langchain provides an
AgentExecutoras the agent runtime.from langchain.agents import AgentExecutor
-
For implementing tools in langchain, it offers a
@tooldecorator which can be imported like so:from langchain_core.tools import tool @tool def multiply_two_integers( a: int, b: int, ) -> int: """ Returns the result of a*b. """ return a*b
- LangChain will extract a schema for this "tool" for agent use:
- name
- description
- expected args
- The model itself doesn't execute a tool, this is done by the langchain backend. The model only produces the arguments need for the call.
- LangChain will extract a schema for this "tool" for agent use:
-
The
RunnableLambdawrapper allows you type-cast any Python function(usually a lambda function) into a langchainRunnable.- Basically, it will make the
invoke(..)method call available on that function. - Since it can be "invoked": it can be chained into a
RunnableSequenceusing LCEL.
from langchain_core.runnables import RunnableLambda
- Basically, it will make the
- For chunking documents with markdown formatting,
langchainoffers some useful resources:from langchain_text_splitters import ( MarkdownHeaderTextSplitter )
- A
MarkdownHeaderTextSplitteris used in conjuction with some loading mechanism which preserves the original markdown formatting to read the document as a string. - For example, you can use Python's built-in file handling:
with open(md_file_path, "r") as f: document: str = f.read()
- NOTE: These tools can be used to load a markdown file as a langchain
Documentbut they do not preserve the original markdown formatting and hence, cannot be used in conjunction with theMarkdownHeaderTextSplitterhowever, could be useful:from langchain_core.documents import Document from langchain_community.document_loaders import UnstructuredMarkdownLoader document_loader = UnstructuredMarkdownLoader( file_path=md_file_path, mode="single", # for chunking the document at this step, you can also pass the mode = "elements". ) document: list[Document] = list(document_loader.lazy_load()) document_content: str = document[0].page_content # only 1 chunk (the entire document). document_metadata: dict = document[0].metadata # only contains a "source" key which maps to the path this document was loaded from.
- For
mode="single": the entire document is loaded as-is without chunking (1 item in thelist[Document]). - For
mode="elements": the entire document is loaded as chunks, broken using the internal parsing logic into categories such as:Text,NarrativeTextandUncategorizedText(multiple items in thelist[Document]).
- For
- A
- Once the document has been loaded, it can be split by the headers in the following manner:
# Step 1 # Initialize splitter document_splitter = MarkdownHeaderTextSplitter( headers_to_split_on=[ ("##", "Section"), ("###", "Policy"), ], strip_headers=False, ) # Step 2 # Split document into chunks chunked_document: list[Document] = document_splitter.split_text(document)
- The
strip_headersflag can be used to control whether the actual header split on will be included in the.page_contentfield of the differentDocumentchunks:# when strip_headers = False Chunk 1 ### Policy 1 <-- this text is not included if strip_headers = True .... Chunk 2 ### Policy 2 ....
- The
namefield of the list of tuples to split on is set as a key in the.metadatafield of eachDocumentchunk:Chunk 1 { 'Section': 'Header starting with ##', 'Policy': 'Policy 1' } - In this case, the lowest "level" header has three hashtags: "###" therefore,
len(chunked_documents)= number of headers at the level###. - Headers with a higher level (
#and##) are included in the first chunk (assuming the remaining document only has headers at the level###).
- The
- For additional chunking, you can either use
langchain_text_splitters.CharacterTextSplitteror alternatively,langchain_text_splitters.RecursiveCharacterTextSplitter(recommended) for more intelligent parsing of text; theRecursiveCharacterTextSplitterhas stricter enforcement of thechunk_sizesince it uses fallback delimiters if a chunk exceeds the given limit.
- A
ReActstyled agent implements the following loop for query resolution:Thought -> Action & Action Input(s) -> Observation -> Thought -> (...) -> Final Answer - Initially, "tool calling" was implemented through text-based prompting, for example:
You have access to the following tools: {tools} Use the following format: Question: the problem you are trying to solve for. Thought: your thought on the appropriate action. Action: the recommended action, must be one of {tools} Action Input: the input to be provided for executing the action Observation: the result of the action (this pattern can repeat N times) When you have resolved the question: Thought: I now think I have a final answer Final Answer: the final answer to the question Begin! Question: {question} Thought: {agent_scratch_pad}- the
ActionandAction Inputwere then parsed by the system backend to finally execute the "tool call". - The
agent_scratch_padis initially an empty string:""- The
agent_scratch_padcaptures the context of the model's recommendedAction,Action InputandObservationthus far. - It expands like so:
Action: <tool_name> Action Input: <tool_args> Observation: <tool_output> (...) Action: <tool_name> Action Input: <tool_args> Observation: <tool_output> Thought: # the `Thought:` is always an empty placeholder prompt appended after the last `Observation` (at the end).
- So the prompt iself would go from:
first iteration
n-th iteration
<aforementioned ReAct instructions> (...) Begin! Question: <user_query> Thought:
<aforementioned ReAct instructions> (...) Begin! Action: <tool_name> Action Input: <tool_args> Observation: <tool_output> (...) Action: <tool_name> Action Input: <tool_args> Observation: <tool_output> Thought:
- The
- To support structured LLM outputs, langchain provides
output_parsersthat work withpydanticdata models to generate expected formats:from langchain_core.output_parsers.pydantic import PydanticOutputParser output_parser = PydanticOutputParser( pydantic_object=<your_data_model> ) # this is only an output parser. prompt_template_with_format_instructions = PromptTemplate(...).partial( <input_variable_used_as_placeholder_for_instructions_in_text_template>=output_parser.get_format_instructions() ) # the expectation set by `get_format_instructions()` is: the LLM will generate an output in a JSON format
- The
.partial(<input_variable>=<variable_value>,...)is useful for returning a "partially formatted"PromptTemplate=> this is useful when you have some variable values beforehand but not all. - REMEMBER: this will return a
PromptTemplatewhile.format(...)returns a string.
- The
- Using "pydantic parsing" in the aforementioned fashion is not favoured because: in this case, we are expecting the LLM to comply with the formatting requirements.
- Depending on the model size (if a smaller model like
gpt-3is used), this might not always work and will often lead to parsing errors. - This is a prompt-level/text-based enforcement of the expected output format and not very robust.
How to built the
AgentExecutorfrom scratch? - This can be done using a simple while loop, expanding the
{agent_scratchpad}with subsequent conversation turns and parsing the output at each agent step to execute tools. - Previously, you would have to supply a "Stop Token" like so:
llm = ChatOpenAI( **kwargs, stop=[ "...", "\nObservation:", # for example "...", # can pass any number of "stop tokens" ] )
- The
stop=argument defines a list of tokens at which point the LLM stops generating further text. - Setting this to
"\nObservation:"will cause:❌ Without stop sequence: Thought: I need to search for this Action: Wikipedia Action Input: "Python programming" Observation: Python is a high-level language... [HALLUCINATED!] Thought: Based on that... [reasoning based on made-up data] ✅ With stop=["\nObservation:"]: Thought: I need to search for this Action: Wikipedia Action Input: "Python programming" [STOPS HERE - agent executes tool] Observation: [REAL Wikipedia result inserted by framework] [Feed back to LLM for next iteration] - The "stop token" itself is not included in the response returned by the LLM; generation stops at the token just before it.
- The problem with this approach was un-timely and unexpected parsing issues, which further supported the move to native-function calling.
- The
- The
langchainframework would then parse the:"\nAction:"and"\nAction Input:"recommended by the LLM and then execute it to continue the ReAct loop.-
Langchain offers built-in parsing which you can use to build your own ReAct loop; the parser will output either of these two objects based on the LLM's response:
AgentAction: if the model recommends an\nAction:with an\nAction Input:AgentFinish: if the model declares that it knows the "final answer".
In the backend, all of this is done through RegEx parsing.
This is useful for individual agent step parsing viz a viz the final output parsing discussed so far.
-
- Depending on the model size (if a smaller model like
- the
- The landscape has evolved now, modern LLMs support native function calling.
- As an extension to the above discussion of "text-based" tool-calling, there is now a more modern and favoured approach for models which support this feature:
llm = ChatGroq(model="openai/gpt-oss-120b") structured_llm = llm.with_structured_output( <your_pydantic_data_model> )
- Models which support native function/tool calling can be chained with the
.with_structured_output(..)call for formatting the response returned. - Now, you can:
- remove the explicit "formatting_instructions" (saving input tokens) passed via text to the "regular"
llm. - use the
structured_llmto format the output response using the simple text response returned by the "regular" llm.
- remove the explicit "formatting_instructions" (saving input tokens) passed via text to the "regular"
- This is also preferred because:
- easier to use (plug and play parsing)
- higher reliability
- all modern state of the art models support native function calling
.with_structured_output(..)falls back to using output parsers when native function calling isn't supported by the model!
- Models which support native function/tool calling can be chained with the
- Having native tool calling support implies that the model provider offers a
tools=argument through their own API. - For production-grade scaling of LLM applications, this is the recommended practice: rely on the model provider to give you structured JSON outputs through their native function calling capabilities.
- The only drawback of using a vendor-provided function calling capability is that the reasoning process of the LLM is a black box however, the tool selection process is considered reliable enough to depend on.
- The models are fine-tuned by the vendors to be able to select the right tool for a task.
- However the reasoning process itself is a blackbox from a debugging perspective.
- For fine-grained control: use Chain of Thought prompting with few-shot examples.
- If you want the vendor to do the "heavy lifting" of tool selection then use native function calling.
- The LLM doesn't actually make the tool call. The onus is still on you to take the args returned by the vendor and actually execute the tool.
langchainprovides a unified interface for using vendor provided native tool calling capabilities of LLMs.ChatModel.bind_tools()to attach tool definitions to model calls.- Then the
AIMessagereturned by the model would have anAIMessage.tool_callswhich is an attribute for easily accessing the tool calls the model has decided to make.
- As an extension to the above discussion of "text-based" tool-calling, there is now a more modern and favoured approach for models which support this feature:
- "Agents" in the langchain ecosystem have evolved in the following manner:
Text-based tool calling through prompt itself -> Native function call supported LLMs as agents -> LangGraph ReAct agents for production demands.
- The first evolution of prompt engineering came from understanding that LLMs gave different (showed better fitting outputs for a use-case) through selection of an appropriate number of examples:
- Zero-shot prompting (no examples)
User: Describe a natural scene for me - One-shot prompting (1 example)
User: Describe a natural scene for me, here are some adjectives to use in your description: vivid, sunny, fresh
- Few-shot prompting (> 1 example)
- Zero-shot prompting (no examples)
- Then came CoT Prompting which proved the hypothesis that typing out the "reasoning" in a question-answer style in the query helped better/more accurate outputs:
- Zero-shot CoT prompting:
User: How would you learn maths from scratch? Let's think through this step by step. - Few-shot CoT prompting:
User: Q: How would you learn math from scratch? A: I would hire a tutor, go to university and pursue a degree in maths. Q: How would you learn english from scratch? A: I would converse in english more often, read more literature and study etymology so I can correlate english origins to my native tongue. Q: How would you learn history from scratch? A:
- Zero-shot CoT prompting:
- Combining CoT Prompting + the ability to execute/act on that reasoning lead to the ReAct architecture, thus using LLMs as the "reasoning engine".
- Problem Statement: When the answer to our question exists inside a specific part of a large document, we need an effective mechanism for breaking up the document into chunks.
- The entire document as a whole is too big to fit within the token limit.
- Using multiple API calls to send chunked versions of the document in sequence is also unnecessary since the information needed is condensed in a specific part.
- There is a solution to this problem and it is called "retrieval augmentation".
- You can use an embedding model to generate vector embeddings out of a document and store it in a vector database.
- A user query on the document like:
Tell me what happens in page 546 of book XYZ?- is also converted into an embedding vector and stored in the database.
- now, you can calculate the nearest neighbours of the user query to get relevant chunks of the document stored in the DB.
- a good embedding model will be able to organize the vectors in the vector space so that semantically similar vectors are close to each other.
- now, we can simply add the relevant chunks of the document as "context" and send it to our LLM.
- A user query on the document like:
- A reflection agent is a heavy duty solution for enhancing the quality of LLM responses.
- How it works:
- A "generator" agent gets a user query -> makes a decision:
- tool call?
- generate a response in natural language?
- When the generator agent is ready (after possible tool calls) to output a response -> this is piped to a "critique".
- The critique makes a decision:
- is it approved? -> display to the user!
- can it be improved? -> get criticism -> feed back to the generator.
- A "generator" agent gets a user query -> makes a decision:
The graph below shows a reflection cycle that lasts 3 improvement interations (this can be controlled programmatically using LangGraph or, left up to the LLM's decision making).
graph TB
Start([User Query]) --> Gen1[Generator AgentFirst Draft]
Gen1 --> Draft1{Draft Response 1}
Draft1 --> Critic1[Critic AgentReviews Draft 1]
Critic1 --> Eval1{Quality Check}
Eval1 -->|Not ApprovedIteration 1| Feedback1[Generate Critique:'Too technical, lacks examples']
Feedback1 --> Gen2[Generator AgentRefines with Feedback]
Gen2 --> Draft2{Draft Response 2}
Draft2 --> Critic2[Critic AgentReviews Draft 2]
Critic2 --> Eval2{Quality Check}
Eval2 -->|Not ApprovedIteration 2| Feedback2[Generate Critique:'Better, but needs sources']
Feedback2 --> Gen3[Generator AgentFinal Refinement]
Gen3 --> Draft3{Draft Response 3}
Draft3 --> Critic3[Critic AgentFinal Review]
Critic3 --> Eval3{Quality Check}
Eval3 -->|Approved OR Max Iterations|Output([Final Output to User])
Eval1 -.->|If Approved Early| Output
Eval2 -.->|If Approved Early| Output
style Gen1 fill:#e1f5ff
style Gen2 fill:#e1f5ff
style Gen3 fill:#e1f5ff
style Critic1 fill:#fff4e1
style Critic2 fill:#fff4e1
style Critic3 fill:#fff4e1
style Output fill:#d4edda
style Start fill:#f8f9fa
classDef iteration fill:#ffe1e1
class Feedback1,Feedback2 iteration
note2[🔄 Iteration Loop:Each cycle refines based on immediate critic feedback]
note3[⚠️ Cost: 6-12 LLM calls for 3-6 iterations]
Key Characteristics:
- 2 LLM calls per iteration
- No cross-session memory
- Expensive but high quality
- Context limited to current conversation
- Here, we have a persistent "critique bank" (typically, a vector index which houses critcisms to the "generator" responses).
- The prompt for the "generator" has a special section where the criticism is rendered in:
REFLEXION_PROMPT = """You are an assistant that learns from past mistakes.
PAST REFLECTIONS (learn from these):
{reflections}
---
Current Task: {user_query}
Based on past failures above, proceed carefully and avoid similar mistakes."""- The vector index is queried at every user turn and semantically similar/relevant "reflections" are used to populate the
PAST REFLECTIONSin the prompt! - In this case, there's a second component: the "Self-Reflection" LLM.
- Unlike the critic in Reflection architecture (which reviews EVERY response), the Self-Reflection LLM is only triggered when the Evaluator detects a failure.
- When triggered, it analyzes WHY the failure occurred and generates a verbal reflection for future reference.
- The "reflection" is stored in the vector index and becomes available in future user interactions (not the current turn).
- The "reflection" is only available in the next turn/next time the user interacts instead of an immediate feedback cycle like in the reflection architecture!
Goal: Learn from past failures across multiple user interactions to avoid repeating mistakes.
graph TB
subgraph Episode1[Episode 1: Initial Failure]
Start1([User Query 1:'How to use LangGraph?']) --> Gen1[Generator Agent Attempts Response]
Gen1 --> Output1[Response:'LangGraph is for...HALLUCINATED CONTENT']
Output1 --> Eval1{Automated Evaluator Ground Truth Check}
Eval1 -->|❌ Failure Detected| Reflect1[Self-Reflecting LLM]
Reflect1 --> Analysis1["Verbal Reflection:'I hallucinated because I didn'tsearch documentation. Next time,ALWAYS use doc search tool firstfor technical framework questions.'"]
Analysis1 --> Store1[(Vector DatabaseReflection Storage)]
end
subgraph Memory[Persistent Memory Layer]
Store1 --> VectorDB[(Reflection Index:
• Task patterns
• Failure modes
• Learned lessons
• Corrective strategies)]
end
subgraph Episode2[Episode 2: Learned Behavior]
Start2([User Query 2:'Explain LangGraph nodes']) --> Retrieve[Semantic Search:Find Similar Past Failures]
VectorDB -.->|Retrieve Relevant Reflections| Retrieve
Retrieve --> Context["Augmented Prompt:PAST REFLECTIONS:'For framework questions,search docs first...'CURRENT TASK:'Explain LangGraph nodes'"]
Context --> Gen2[Generator Agent WITH Past Lessons]
Gen2 --> Tool[🔧 Uses Doc Search Tool Learned from Reflection!]
Tool --> Output2[Response:Accurate, grounded in docs ✓]
Output2 --> Eval2{Automated Evaluator}
Eval2 -->|✅ Success!| NoReflect[No New Reflection Needed]
end
Store1 -.->|Persists Across Sessions| VectorDB
style Gen1 fill:#ffe1e1
style Output1 fill:#ffe1e1
style Reflect1 fill:#fff4e1
style Analysis1 fill:#fff4e1
style Store1 fill:#e1e8ff
style VectorDB fill:#e1e8ff
style Gen2 fill:#d4edda
style Output2 fill:#d4edda
style Tool fill:#d4edda
style Start1 fill:#f8f9fa
style Start2 fill:#f8f9fa
note2[🧠 Memory Persists:Reflections stored in vector DB,retrieved by semantic similarity]
note3[⚡ Efficiency: 1 extra LLM callonly on failures, then benefitcompounds over time]
Key Characteristics:
-
Single-pass per episode
-
Learns across sessions
-
Builds institutional memory
-
Amortized cost improvement
-
The Automated Evaluator can be:
- Unit tests (for code generation tasks)
- Ground truth comparison (for factual questions)
- Reward signal (for RL-style tasks)
- Human feedback (less common, but possible)
- Key point: It's not necessarily an LLM - it's any success/failure check.
- Providing links to resources (real and accurate references) for answers is called "grounding".
- Grounding an agent is important for building trust with the customer.
- This helps validate that the answer is not a model hallucination.







