diff --git a/notebooks/agents/langchain_agent_simple_demo.ipynb b/notebooks/agents/langchain_agent_simple_demo.ipynb deleted file mode 100644 index c3658a07e..000000000 --- a/notebooks/agents/langchain_agent_simple_demo.ipynb +++ /dev/null @@ -1,1074 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "# Simplified LangChain Agent Model Documentation\n", - "\n", - "This notebook demonstrates how to build and validate a simplified AI agent using LangChain's tool calling functionality integrated with ValidMind for comprehensive testing and monitoring.\n", - "\n", - "Learn how to create intelligent agents that can:\n", - "- **Automatically select appropriate tools** based on user queries using LLM-powered tool calling\n", - "- **Handle conversations** with intelligent tool selection\n", - "- **Use two specialized tools** with smart decision-making\n", - "- **Provide validation and testing** through ValidMind integration\n", - "\n", - "We'll build a simplified agent system that intelligently routes user requests to two specialized tools: **search_engine** for document search and **task_assistant** for general assistance, then validate its performance using ValidMind's testing framework.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "## Setup and Imports\n", - "\n", - "First, let's import all the necessary libraries for building our LangChain agent system:\n", - "\n", - "- **LangChain components** for LLM integration and tool management\n", - "- **LangChain tool calling** for intelligent tool selection and execution\n", - "- **ValidMind** for model validation and testing\n", - "- **Standard libraries** for data handling and environment management\n", - "\n", - "The setup includes loading environment variables (like OpenAI API keys) needed for the LLM components to function properly.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q langchain validmind openai" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from typing import Optional, Dict, Any\n", - "from langchain.tools import tool\n", - "from langchain_core.messages import HumanMessage, SystemMessage\n", - "from langchain_openai import ChatOpenAI\n", - "\n", - "# Load environment variables if using .env file\n", - "try:\n", - " from dotenv import load_dotenv\n", - " load_dotenv()\n", - "except ImportError:\n", - " print(\"dotenv not installed. Make sure OPENAI_API_KEY is set in your environment.\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import validmind as vm\n", - "\n", - "vm.init(\n", - " api_host=\"...\",\n", - " api_key=\"...\",\n", - " api_secret=\"...\",\n", - " model=\"...\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "## LLM-Powered Tool Selection Router\n", - "\n", - "This section demonstrates how to create an intelligent router that uses an LLM to select the most appropriate tool based on user input and tool docstrings.\n", - "\n", - "### Benefits of LLM-Based Tool Selection:\n", - "- **Intelligent Routing**: Understanding of natural language intent\n", - "- **Dynamic Selection**: Can handle complex, multi-step requests \n", - "- **Context Awareness**: Considers conversation history and context\n", - "- **Flexible Matching**: Not limited to keyword patterns\n", - "- **Tool Documentation**: Uses actual tool docstrings for decision making\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Simplified Tools with Rich Docstrings\n", - "\n", - "We've simplified the agent to use only two core tools:\n", - "- **search_engine**: For searching through documents, policies, and knowledge base \n", - "- **task_assistant**: For general-purpose task assistance and problem-solving\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Search Engine Tool\n", - "@tool\n", - "def search_engine(query: str, document_type: Optional[str] = \"all\") -> str:\n", - " \"\"\"\n", - " Search through internal documents, policies, and knowledge base.\n", - " \n", - " This tool can search for:\n", - " - Company policies and procedures\n", - " - Technical documentation and manuals\n", - " - Compliance and regulatory documents\n", - " - Historical records and reports\n", - " - Product specifications and requirements\n", - " - Legal documents and contracts\n", - " \n", - " Args:\n", - " query (str): Search terms or questions about documents\n", - " document_type (str, optional): Type of document to search (\"policy\", \"technical\", \"legal\", \"all\")\n", - " \n", - " Returns:\n", - " str: Relevant document excerpts and references\n", - " \n", - " Examples:\n", - " - \"Find our data privacy policy\"\n", - " - \"Search for loan approval procedures\"\n", - " - \"What are the security guidelines for API access?\"\n", - " - \"Show me compliance requirements for financial reporting\"\n", - " \"\"\"\n", - " document_db = {\n", - " \"policy\": [\n", - " \"Data Privacy Policy: All personal data must be encrypted...\",\n", - " \"Remote Work Policy: Employees may work remotely up to 3 days...\",\n", - " \"Security Policy: All systems require multi-factor authentication...\"\n", - " ],\n", - " \"technical\": [\n", - " \"API Documentation: REST endpoints available at /api/v1/...\",\n", - " \"Database Schema: User table contains id, name, email...\",\n", - " \"Deployment Guide: Use Docker containers with Kubernetes...\"\n", - " ],\n", - " \"legal\": [\n", - " \"Terms of Service: By using this service, you agree to...\",\n", - " \"Privacy Notice: We collect information to provide services...\",\n", - " \"Compliance Framework: SOX requirements mandate quarterly audits...\"\n", - " ]\n", - " }\n", - " \n", - " results = []\n", - " search_types = [document_type] if document_type != \"all\" else document_db.keys()\n", - " \n", - " for doc_type in search_types:\n", - " if doc_type in document_db:\n", - " for doc in document_db[doc_type]:\n", - " if any(term.lower() in doc.lower() for term in query.split()):\n", - " results.append(f\"[{doc_type.upper()}] {doc}\")\n", - " \n", - " if not results:\n", - " results.append(f\"No documents found matching '{query}'\")\n", - " \n", - " return \"\\n\\n\".join(results)\n", - "\n", - "# Task Assistant Tool\n", - "@tool\n", - "def task_assistant(task_description: str, context: Optional[str] = None) -> str:\n", - " \"\"\"\n", - " General-purpose task assistance and problem-solving tool.\n", - " \n", - " This tool can help with:\n", - " - Breaking down complex tasks into steps\n", - " - Providing guidance and recommendations\n", - " - Answering questions and explaining concepts\n", - " - Suggesting solutions to problems\n", - " - Planning and organizing activities\n", - " - Research and information gathering\n", - " \n", - " Args:\n", - " task_description (str): Description of the task or question\n", - " context (str, optional): Additional context or background information\n", - " \n", - " Returns:\n", - " str: Helpful guidance, steps, or information for the task\n", - " \n", - " Examples:\n", - " - \"How do I prepare for a job interview?\"\n", - " - \"What are the steps to deploy a web application?\"\n", - " - \"Help me plan a team meeting agenda\"\n", - " - \"Explain machine learning concepts for beginners\"\n", - " \"\"\"\n", - " responses = {\n", - " \"meeting\": \"For planning meetings: 1) Define objectives, 2) Create agenda, 3) Invite participants, 4) Prepare materials, 5) Set time limits\",\n", - " \"interview\": \"Interview preparation: 1) Research the company, 2) Practice common questions, 3) Prepare examples, 4) Plan your outfit, 5) Arrive early\",\n", - " \"deploy\": \"Deployment steps: 1) Test in staging, 2) Backup production, 3) Deploy code, 4) Run health checks, 5) Monitor performance\",\n", - " \"learning\": \"Learning approach: 1) Start with basics, 2) Practice regularly, 3) Build projects, 4) Join communities, 5) Stay updated\"\n", - " }\n", - " \n", - " task_lower = task_description.lower()\n", - " for key, response in responses.items():\n", - " if key in task_lower:\n", - " return f\"Task assistance for '{task_description}':\\n\\n{response}\"\n", - " \n", - " \n", - " return f\"\"\"For the task '{task_description}', I recommend: 1) Break it into smaller steps, 2) Gather necessary resources, 3)\n", - " Create a timeline, 4) Start with the most critical parts, 5) Review and adjust as needed.\n", - " \"\"\"\n", - "\n", - "# Collect all tools for the LLM router - SIMPLIFIED TO ONLY 2 TOOLS\n", - "AVAILABLE_TOOLS = [\n", - " search_engine,\n", - " task_assistant\n", - "]\n", - "\n", - "print(\"Simplified tools created!\")\n", - "print(f\"Available tools: {len(AVAILABLE_TOOLS)}\")\n", - "for tool in AVAILABLE_TOOLS:\n", - " print(f\" - {tool.name}: {tool.description[:50]}...\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Complete LangChain Agent with Tool Calling\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def create_intelligent_langchain_agent():\n", - " \"\"\"Create a simplified LangChain agent with direct tool calling.\"\"\"\n", - " \n", - " # Initialize the main LLM for responses\n", - " llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0.7)\n", - " \n", - " # Bind tools to the LLM\n", - " llm_with_tools = llm.bind_tools(AVAILABLE_TOOLS)\n", - " \n", - " # Enhanced system prompt with tool selection guidance\n", - " system_prompt = \"\"\"You are a helpful AI assistant with access to specialized tools. Analyze the user's request and directly use the most appropriate tools to help them.\n", - "\n", - " AVAILABLE TOOLS:\n", - " 🔍 **search_engine** - Search through internal documents, policies, and knowledge base\n", - " - Use for: finding company policies, technical documentation, compliance documents\n", - " - Examples: \"Find our data privacy policy\", \"Search for API documentation\"\n", - "\n", - " **task_assistant** - General-purpose task assistance and problem-solving \n", - " - Use for: guidance, recommendations, explaining concepts, planning activities\n", - " - Examples: \"How to prepare for an interview\", \"Help plan a meeting\", \"Explain machine learning\"\n", - "\n", - " INSTRUCTIONS:\n", - " - Analyze the user's request carefully\n", - " - If they need to find documents/policies → use search_engine\n", - " - If they need general help/guidance/explanations → use task_assistant \n", - " - If the request needs specific information search, use search_engine first\n", - " - You can use tools directly based on the user's needs\n", - " - Provide helpful, accurate responses based on tool outputs\n", - " - If no tools are needed, respond conversationally\n", - "\n", - " Choose and use tools wisely to provide the most helpful response.\"\"\"\n", - "\n", - " def invoke_agent(user_input: str, session_id: str = \"default\") -> Dict[str, Any]:\n", - " \"\"\"Invoke the agent with tool calling support.\"\"\"\n", - " \n", - " # Create conversation with system prompt\n", - " messages = [\n", - " SystemMessage(content=system_prompt),\n", - " HumanMessage(content=user_input)\n", - " ]\n", - " \n", - " # Get initial response from LLM\n", - " response = llm_with_tools.invoke(messages)\n", - " messages.append(response)\n", - " tools_used = []\n", - " # Check if the LLM wants to use tools\n", - " if hasattr(response, 'tool_calls') and response.tool_calls:\n", - " # Execute tool calls\n", - " for tool_call in response.tool_calls:\n", - " # Find the matching tool\n", - " tool_to_call = None\n", - " for tool in AVAILABLE_TOOLS:\n", - " if tool.name == tool_call['name']:\n", - " tool_to_call = tool\n", - " tools_used.append(tool_to_call.name)\n", - " break\n", - " \n", - " if tool_to_call:\n", - " # Execute the tool\n", - " try:\n", - "\n", - " tool_result = tool_to_call.invoke(tool_call['args'])\n", - " # Add tool message to conversation\n", - " from langchain_core.messages import ToolMessage\n", - " messages.append(ToolMessage(\n", - " content=str(tool_result),\n", - " tool_call_id=tool_call['id']\n", - " ))\n", - " except Exception as e:\n", - " messages.append(ToolMessage(\n", - " content=f\"Error executing tool {tool_call['name']}: {str(e)}\",\n", - " tool_call_id=tool_call['id']\n", - " ))\n", - " \n", - " # Get final response after tool execution\n", - " final_response = llm.invoke(messages)\n", - " messages.append(final_response)\n", - " \n", - " return {\n", - " \"messages\": messages,\n", - " \"user_input\": user_input,\n", - " \"session_id\": session_id,\n", - " \"context\": {},\n", - " \"tools_used\": tools_used\n", - " }\n", - " \n", - " return invoke_agent\n", - "\n", - "# Create the simplified intelligent agent\n", - "intelligent_agent = create_intelligent_langchain_agent()\n", - "\n", - "print(\"Simplified LangChain Agent Created!\")\n", - "print(\"Features:\")\n", - "print(\" - Direct LLM tool calling (native LangChain functionality)\")\n", - "print(\" - Enhanced system prompt for intelligent tool choice\")\n", - "print(\" - Simple workflow: LLM -> Tools -> Final Response\")\n", - "print(\" - Automatic tool parameter extraction\")\n", - "print(\" - Clean, simplified architecture\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ValidMind Model Integration\n", - "\n", - "Now we'll integrate our LangChain agent with ValidMind for comprehensive testing and validation. This step is crucial for:\n", - "\n", - "**Model Wrapping**: We create a wrapper function (`agent_fn`) that standardizes the agent interface for ValidMind\n", - "- **Input Formatting**: Converts ValidMind inputs to the agent's expected format\n", - "- **Session Management**: Handles conversation threads and session tracking\n", - "- **Result Processing**: Returns agent responses in a consistent format\n", - "\n", - "**ValidMind Agent Initialization**: Using `vm.init_model()` creates a ValidMind model object that:\n", - "- **Enables Testing**: Allows us to run validation tests on the agent\n", - "- **Tracks Performance**: Monitors agent behavior and responses \n", - "- **Provides Documentation**: Generates documentation and analysis reports\n", - "- **Supports Evaluation**: Enables quantitative assessment of agent capabilities\n", - "\n", - "This integration allows us to treat our LangChain agent like any other machine learning model in the ValidMind ecosystem, enabling comprehensive testing and validation workflows." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def agent_fn(input):\n", - " \"\"\"\n", - " Invoke the simplified agent with the given input.\n", - " \"\"\"\n", - " user_input = input[\"input\"]\n", - " session_id = input[\"session_id\"]\n", - " \n", - " # Invoke the agent with the user input\n", - " result = intelligent_agent(user_input, session_id)\n", - " \n", - " return {\"prediction\": result['messages'][-1].content, \"output\": result, \"tools_used\": result['tools_used']}\n", - "\n", - "\n", - "vm_intelligent_model = vm.init_model(input_id=\"financial_model\", predict_fn=agent_fn)\n", - "# add model to the vm agent - store the agent function\n", - "vm_intelligent_model.model = intelligent_agent" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prepare Sample Test Dataset\n", - "\n", - "We'll create a comprehensive test dataset to evaluate our agent's performance across different scenarios. This dataset includes:\n", - "\n", - "**Diverse Test Cases**: Various types of user requests that test different agent capabilities:\n", - "- **Single Tool Requests**: Simple queries that require one specific tool\n", - "- **Multi-Tool Requests**: Complex queries requiring multiple tools in sequence \n", - "- **Validation Tasks**: Requests for data validation and verification\n", - "- **General Assistance**: Open-ended questions for problem-solving guidance\n", - "\n", - "**Expected Outputs**: For each test case, we define:\n", - "- **Expected Tools**: Which tools should be selected by the router\n", - "- **Possible Outputs**: Valid response patterns or values\n", - "- **Session IDs**: Unique identifiers for conversation tracking\n", - "\n", - "**Test Coverage**: The dataset covers:\n", - "- Document retrieval (search_engine tool)\n", - "- General guidance (task_assistant tool)\n", - "\n", - "This structured approach allows us to systematically evaluate both tool selection accuracy and response quality." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import uuid\n", - "\n", - "# Simplified test dataset with only search_engine and task_assistant tools\n", - "test_dataset = pd.DataFrame([\n", - " {\n", - " \"input\": \"Find our company's data privacy policy\",\n", - " \"expected_tools\": [\"search_engine\"],\n", - " \"possible_outputs\": [\"privacy_policy.pdf\", \"data_protection.doc\", \"company_privacy_guidelines.txt\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Search for loan approval procedures\", \n", - " \"expected_tools\": [\"search_engine\"],\n", - " \"possible_outputs\": [\"loan_procedures.doc\", \"approval_process.pdf\", \"lending_guidelines.txt\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"How should I prepare for a technical interview?\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"algorithms\", \"data structures\", \"system design\", \"coding practice\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Help me understand machine learning basics\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"supervised\", \"unsupervised\", \"neural networks\", \"training\", \"testing\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"What can you do for me?\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"search documents\", \"provide assistance\", \"answer questions\", \"help with tasks\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Find technical documentation about API endpoints\",\n", - " \"expected_tools\": [\"search_engine\"],\n", - " \"possible_outputs\": [\"API_documentation.pdf\", \"REST_endpoints.doc\", \"technical_guide.txt\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Help me plan a team meeting agenda\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"objectives\", \"agenda\", \"participants\", \"materials\", \"time limits\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " }\n", - "])\n", - "\n", - "print(\"Simplified test dataset created!\")\n", - "print(f\"Number of test cases: {len(test_dataset)}\")\n", - "print(f\"Test tools: {test_dataset['expected_tools'].explode().unique()}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Display the simplified test dataset\n", - "print(\"Using simplified test dataset with only 2 tools:\")\n", - "print(f\"Number of test cases: {len(test_dataset)}\")\n", - "print(f\"Available tools being tested: {sorted(test_dataset['expected_tools'].explode().unique())}\")\n", - "print(\"\\nTest cases preview:\")\n", - "for i, row in test_dataset.iterrows():\n", - " print(f\"{i+1}. {row['input']} -> Expected tool: {row['expected_tools'][0]}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Initialize ValidMind Dataset\n", - "\n", - "Before we can run tests and evaluations, we need to initialize our test dataset as a ValidMind dataset object. This process:\n", - "\n", - "**Dataset Registration**: Creates a ValidMind dataset object that can be used in testing workflows\n", - "- **Input Identification**: Assigns a unique `input_id` for tracking and reference\n", - "- **Target Column Definition**: Specifies which column contains expected outputs for evaluation\n", - "- **Metadata Preservation**: Maintains all dataset information and structure\n", - "\n", - "**Testing Preparation**: The initialized dataset enables:\n", - "- **Systematic Evaluation**: Consistent testing across all data points\n", - "- **Performance Tracking**: Monitoring of agent responses and accuracy\n", - "- **Result Documentation**: Automatic generation of test reports and metrics\n", - "- **Comparison Analysis**: Benchmarking against expected outputs\n", - "\n", - "This step is essential for integrating our agent evaluation into ValidMind's comprehensive testing and validation framework.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset = vm.init_dataset(\n", - " input_id=\"test_dataset\",\n", - " dataset=test_dataset,\n", - " target_column=\"possible_outputs\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Run Agent and Assign Predictions\n", - "\n", - "Now we'll execute our agent on the test dataset and capture its responses for evaluation. This step:\n", - "\n", - "**Agent Execution**: Runs the agent on each test case in our dataset\n", - "- **Automatic Processing**: Iterates through all test inputs systematically\n", - "- **Response Capture**: Records complete agent responses including tool calls and outputs\n", - "- **Session Management**: Maintains separate conversation threads for each test case\n", - "- **Error Handling**: Gracefully manages any execution failures or timeouts\n", - "\n", - "**Prediction Assignment**: Links agent responses to the dataset for analysis\n", - "- **Response Mapping**: Associates each input with its corresponding agent output \n", - "- **Metadata Preservation**: Maintains conversation state, tool calls, and routing decisions\n", - "- **Format Standardization**: Ensures responses are in a consistent format for evaluation\n", - "\n", - "This process generates the prediction data needed for comprehensive performance evaluation and comparison against expected outputs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.assign_predictions(vm_intelligent_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Dataframe display settings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pd.set_option('display.max_colwidth', 40)\n", - "pd.set_option('display.width', 120)\n", - "pd.set_option('display.max_colwidth', None)\n", - "vm_test_dataset._df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Visualization\n", - "\n", - "This test validates and documents the LangChain agent's structure and capabilities:\n", - "- Verifies proper agent function configuration\n", - "- Documents available tools and their descriptions\n", - "- Validates core agent functionality and architecture\n", - "- Returns detailed agent information and test results \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "@vm.test(\"my_custom_tests.LangChainAgentInfo\")\n", - "def LangChainAgentInfo(model):\n", - " \"\"\"\n", - " Provides information about the LangChain agent structure and capabilities.\n", - " \n", - " ### Purpose\n", - " Documents the LangChain agent's architecture and available tools to validate\n", - " that the agent is properly configured with the expected functionality.\n", - " \n", - " ### Test Mechanism\n", - " 1. Validates that the model has the expected agent function\n", - " 2. Documents the available tools and their capabilities\n", - " 3. Returns agent information and validation results\n", - " \n", - " ### Signs of High Risk\n", - " - Missing agent function indicates setup issues\n", - " - Incorrect number of tools or missing expected tools\n", - " - Agent function not callable\n", - " \"\"\"\n", - " try:\n", - " # Check if model has the agent function\n", - " if not hasattr(model, 'model') or not callable(model.model):\n", - " return {\n", - " 'test_results': False,\n", - " 'summary': {\n", - " 'status': 'FAIL', \n", - " 'details': 'Model must have a callable agent function as model attribute'\n", - " }\n", - " }\n", - " \n", - " # Document agent capabilities\n", - " agent_info = {\n", - " 'agent_type': 'LangChain Tool Calling Agent',\n", - " 'available_tools': [tool.name for tool in AVAILABLE_TOOLS],\n", - " 'tool_descriptions': {tool.name: tool.description for tool in AVAILABLE_TOOLS},\n", - " 'architecture': 'LLM with bound tools -> Tool execution -> Final response',\n", - " 'features': [\n", - " 'Direct LLM tool calling',\n", - " 'Enhanced system prompt for tool selection',\n", - " 'Simple workflow execution',\n", - " 'Automatic tool parameter extraction'\n", - " ]\n", - " }\n", - " \n", - " return {\n", - " 'agent_info': agent_info\n", - " }\n", - " \n", - " except Exception as e:\n", - " return {\n", - " 'test_results': False, \n", - " 'summary': {\n", - " 'status': 'FAIL',\n", - " 'details': f'Failed to analyze agent structure: {str(e)}'\n", - " }\n", - " }\n", - "\n", - "vm.tests.run_test(\n", - " \"my_custom_tests.LangChainAgentInfo\",\n", - " inputs = {\n", - " \"model\": vm_intelligent_model\n", - " }\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Accuracy Test\n", - "\n", - "The purpose of this test is to evaluate the agent's ability to provide accurate responses by:\n", - "- Testing against a dataset of predefined questions and expected answers\n", - "- Checking if responses contain expected keywords\n", - "- Providing detailed test results including pass/fail status\n", - "- Helping identify any gaps in the agent's knowledge or response quality" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import validmind as vm\n", - "\n", - "@vm.test(\"my_custom_tests.accuracy_test\")\n", - "def accuracy_test(model, dataset, list_of_columns):\n", - " \"\"\"\n", - " Run tests on a dataset of questions and expected responses.\n", - " Optimized version using vectorized operations and list comprehension.\n", - " \"\"\"\n", - " df = dataset._df\n", - " \n", - " # Pre-compute responses for all tests\n", - " y_true = dataset.y.tolist()\n", - " y_pred = dataset.y_pred(model).tolist()\n", - "\n", - " # Vectorized test results\n", - " test_results = []\n", - " for response, keywords in zip(y_pred, y_true):\n", - " test_results.append(any(str(keyword).lower() in str(response).lower() for keyword in keywords))\n", - " \n", - " results = pd.DataFrame()\n", - " column_names = [col + \"_details\" for col in list_of_columns]\n", - " results[column_names] = df[list_of_columns]\n", - " results[\"actual\"] = y_pred\n", - " results[\"expected\"] = y_true\n", - " results[\"passed\"] = test_results\n", - " results[\"error\"] = None if test_results else f'Response did not contain any expected keywords: {y_true}'\n", - " \n", - " return results\n", - " \n", - "result = vm.tests.run_test(\n", - " \"my_custom_tests.accuracy_test\",\n", - " inputs={\n", - " \"dataset\": vm_test_dataset,\n", - " \"model\": vm_intelligent_model\n", - " },\n", - " params={\n", - " \"list_of_columns\": [\"input\"]\n", - " }\n", - ")\n", - "result.log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Tool Call Accuracy Test\n", - "\n", - "This test evaluates how accurately our intelligent router selects the correct tools for different user requests. It's a critical validation step that measures:\n", - "\n", - "**Tool Selection Performance**: Analyzes whether the agent correctly identifies and calls the expected tools\n", - "- **Expected vs. Actual**: Compares tools that should be called with tools that were actually called\n", - "- **Accuracy Scoring**: Calculates percentage accuracy for tool selection decisions\n", - "- **Multi-tool Handling**: Evaluates performance on requests requiring multiple tools\n", - "\n", - "**Router Intelligence Assessment**: Validates the LLM-powered routing system's effectiveness\n", - "- **Intent Recognition**: How well the router understands user intent from natural language\n", - "- **Tool Mapping**: Accuracy of mapping user needs to appropriate tool capabilities\n", - "- **Decision Quality**: Assessment of routing confidence and reasoning\n", - "\n", - "**Failure Analysis**: Identifies patterns in incorrect tool selections to improve the routing logic\n", - "- **Missed Tools**: Cases where expected tools weren't selected\n", - "- **Extra Tools**: Cases where unnecessary tools were selected \n", - "- **Wrong Tools**: Cases where completely incorrect tools were selected\n", - "\n", - "This test provides quantitative feedback on the agent's core intelligence - its ability to understand what users need and select the right tools to help them." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import validmind as vm\n", - "\n", - "# Test with a real LangChain agent result instead of creating mock objects\n", - "@vm.test(\"my_custom_tests.tool_call_accuracy\")\n", - "def tool_call_accuracy(dataset, agent_output_column, expected_tools_column):\n", - " \"\"\"Test validation using actual LangChain agent results.\"\"\"\n", - " # Let's create a simpler validation without the complex RAGAS setup\n", - " def validate_tool_calls_simple(messages, expected_tools):\n", - " \"\"\"Simple validation of tool calls without RAGAS dependency issues.\"\"\"\n", - " \n", - " tool_calls_found = []\n", - " \n", - " for message in messages:\n", - " if hasattr(message, 'tool_calls') and message.tool_calls:\n", - " for tool_call in message.tool_calls:\n", - " # Handle both dictionary and object formats\n", - " if isinstance(tool_call, dict):\n", - " tool_calls_found.append(tool_call['name'])\n", - " else:\n", - " # ToolCall object - use attribute access\n", - " tool_calls_found.append(tool_call.name)\n", - " \n", - " # Check if expected tools were called\n", - " accuracy = 0.0\n", - " matches = 0\n", - " if expected_tools:\n", - " matches = sum(1 for tool in expected_tools if tool in tool_calls_found)\n", - " accuracy = matches / len(expected_tools)\n", - " \n", - " return {\n", - " 'accuracy': accuracy,\n", - " 'expected_tools': expected_tools,\n", - " 'found_tools': tool_calls_found,\n", - " 'matches': matches,\n", - " 'total_expected': len(expected_tools) if expected_tools else 0\n", - " }\n", - "\n", - " df = dataset._df\n", - " \n", - " results = []\n", - " for i, row in df.iterrows():\n", - " result = validate_tool_calls_simple(row[agent_output_column]['messages'], row[expected_tools_column])\n", - " results.append(result)\n", - " \n", - " return results\n", - "\n", - "vm.tests.run_test(\n", - " \"my_custom_tests.tool_call_accuracy\",\n", - " inputs = {\n", - " \"dataset\": vm_test_dataset,\n", - " },\n", - " params = {\n", - " \"agent_output_column\": \"output\",\n", - " \"expected_tools_column\": \"expected_tools\"\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## RAGAS Tests for Agent Evaluation\n", - "\n", - "RAGAS (Retrieval-Augmented Generation Assessment) provides specialized metrics for evaluating conversational AI systems like our LangChain agent. These tests analyze different aspects of agent performance:\n", - "\n", - "**Why RAGAS for Agents**: Our agent uses tools to retrieve information (documents, task assistance) and generates responses based on that context, making it similar to a RAG system. RAGAS metrics help evaluate:\n", - "\n", - "- **Response Quality**: How well the agent uses retrieved tool outputs to generate helpful responses\n", - "- **Information Faithfulness**: Whether agent responses accurately reflect tool outputs \n", - "- **Relevance Assessment**: How well responses address the original user query\n", - "- **Context Utilization**: How effectively the agent incorporates tool results into final answers\n", - "\n", - "**Test Preparation**: We extract tool outputs as \"context\" for RAGAS evaluation:\n", - "- **Tool Message Extraction**: Capture outputs from search_engine and task_assistant tools\n", - "- **Context Mapping**: Treat tool results as retrieved context for evaluation\n", - "- **Response Analysis**: Evaluate final agent responses against both user input and tool context\n", - "\n", - "These tests provide insights into how well our agent integrates tool usage with conversational abilities, ensuring it provides accurate, relevant, and helpful responses to users.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Dataset Preparation - Extract Context from Agent State\n", - "\n", - "Before running RAGAS tests, we need to extract and prepare the context information from our agent's execution results. This process:\n", - "\n", - "**Tool Output Extraction**: Retrieves the outputs from tools used during agent execution\n", - "- **Message Parsing**: Analyzes the agent's conversation state to find tool outputs\n", - "- **Content Aggregation**: Combines outputs from multiple tools when used in sequence\n", - "- **Context Formatting**: Structures tool outputs as context for RAGAS evaluation\n", - "\n", - "**RAGAS Format Preparation**: Converts agent data into the format expected by RAGAS metrics\n", - "- **User Input**: Original user queries from the test dataset\n", - "- **Retrieved Context**: Tool outputs treated as \"retrieved\" information \n", - "- **Agent Response**: Final responses generated by the agent\n", - "- **Ground Truth**: Expected outputs for comparison\n", - "\n", - "This preparation step is essential because RAGAS metrics were designed for traditional RAG systems, so we need to map our agent's tool-based architecture to the RAG paradigm for meaningful evaluation. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from notebooks.agents.langchain_utils import capture_tool_output_messages\n", - "\n", - "tool_messages = []\n", - "for i, row in vm_test_dataset._df.iterrows():\n", - " tool_message = \"\"\n", - " # Print messages in a readable format\n", - " result = row['output']\n", - " # Capture all tool outputs and metadata\n", - " captured_data = capture_tool_output_messages(result)\n", - " \n", - " # Access specific tool outputs\n", - " for output in captured_data[\"tool_outputs\"]:\n", - " tool_message += output['content']\n", - " tool_messages.append([tool_message])\n", - "\n", - "vm_test_dataset._df['tool_messages'] = tool_messages" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset._df.head(2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Faithfulness\n", - "\n", - "Faithfulness measures how accurately the agent's responses reflect the information retrieved from tools. This metric evaluates:\n", - "\n", - "**Information Accuracy**: Whether the agent correctly uses tool outputs in its responses\n", - "- **Fact Preservation**: Ensuring numerical results, weather data, and document content are accurately reported\n", - "- **No Hallucination**: Verifying the agent doesn't invent information not provided by tools\n", - "- **Source Attribution**: Checking that responses align with actual tool outputs\n", - "\n", - "**Critical for Agent Trust**: Faithfulness is essential for agent reliability because users need to trust that:\n", - "- Calculator results are reported correctly\n", - "- Weather information is accurate \n", - "- Document searches return real information\n", - "- Validation results are properly communicated" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.Faithfulness\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"response_column\": [\"financial_model_prediction\"],\n", - " \"retrieved_contexts_column\": [\"tool_messages\"],\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Response Relevancy\n", - "\n", - "Response Relevancy evaluates how well the agent's answers address the user's original question or request. This metric assesses:\n", - "\n", - "**Query Alignment**: Whether responses directly answer what users asked for\n", - "- **Intent Fulfillment**: Checking if the agent understood and addressed the user's actual need\n", - "- **Completeness**: Ensuring responses provide sufficient information to satisfy the query\n", - "- **Focus**: Avoiding irrelevant information that doesn't help the user\n", - "\n", - "**Conversational Quality**: Measures the agent's ability to maintain relevant, helpful dialogue\n", - "- **Context Awareness**: Responses should be appropriate for the conversation context\n", - "- **User Satisfaction**: Answers should be useful and actionable for the user\n", - "- **Clarity**: Information should be presented in a way that directly helps the user\n", - "\n", - "High relevancy indicates the agent successfully understands user needs and provides targeted, helpful responses." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.ResponseRelevancy\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " params={\n", - " \"user_input_column\": \"input\",\n", - " \"response_column\": \"financial_model_prediction\",\n", - " \"retrieved_contexts_column\": \"tool_messages\",\n", - " }\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Context Recall\n", - "\n", - "Context Recall measures how well the agent utilizes the information retrieved from tools when generating its responses. This metric evaluates:\n", - "\n", - "**Information Utilization**: Whether the agent effectively incorporates tool outputs into its responses\n", - "- **Coverage**: How much of the available tool information is used in the response\n", - "- **Integration**: How well tool outputs are woven into coherent, natural responses\n", - "- **Completeness**: Whether all relevant information from tools is considered\n", - "\n", - "**Tool Effectiveness**: Assesses whether selected tools provide useful context for responses\n", - "- **Relevance**: Whether tool outputs actually help answer the user's question\n", - "- **Sufficiency**: Whether enough information was retrieved to generate good responses\n", - "- **Quality**: Whether the tools provided accurate, helpful information\n", - "\n", - "High context recall indicates the agent not only selects the right tools but also effectively uses their outputs to create comprehensive, well-informed responses." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.ContextRecall\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"retrieved_contexts_column\": [\"tool_messages\"],\n", - " \"reference_column\": [\"financial_model_prediction\"],\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### AspectCritic\n", - "\n", - "AspectCritic provides comprehensive evaluation across multiple dimensions of agent performance. This metric analyzes various aspects of response quality:\n", - "\n", - "**Multi-Dimensional Assessment**: Evaluates responses across different quality criteria\n", - "- **Helpfulness**: Whether responses genuinely assist users in accomplishing their goals\n", - "- **Relevance**: How well responses address the specific user query\n", - "- **Coherence**: Whether responses are logically structured and easy to follow\n", - "- **Correctness**: Accuracy of information and appropriateness of recommendations\n", - "\n", - "**Holistic Quality Scoring**: Provides an overall assessment that considers:\n", - "- **User Experience**: How satisfying and useful the interaction would be for real users\n", - "- **Professional Standards**: Whether responses meet quality expectations for production systems\n", - "- **Consistency**: Whether the agent maintains quality across different types of requests\n", - "\n", - "AspectCritic helps identify specific areas where the agent excels or needs improvement, providing actionable insights for enhancing overall performance and user satisfaction." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.AspectCritic\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"response_column\": [\"financial_model_prediction\"],\n", - " \"retrieved_contexts_column\": [\"tool_messages\"],\n", - " },\n", - ").log()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "ValidMind Library", - "language": "python", - "name": "validmind" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/notebooks/agents/langchain_utils.py b/notebooks/agents/langchain_utils.py deleted file mode 100644 index e10954f28..000000000 --- a/notebooks/agents/langchain_utils.py +++ /dev/null @@ -1,29 +0,0 @@ -from typing import Dict, Any -from langchain_core.messages import ToolMessage - - -def capture_tool_output_messages(agent_result: Dict[str, Any]) -> Dict[str, Any]: - """ - Capture all tool outputs and metadata from agent results. - - Args: - agent_result: The result from the LangChain agent execution - Returns: - Dictionary containing tool outputs and metadata - """ - messages = agent_result.get('messages', []) - tool_outputs = [] - - for message in messages: - if isinstance(message, ToolMessage): - tool_outputs.append({ - 'tool_name': 'unknown', # ToolMessage doesn't directly contain tool name - 'content': message.content, - 'tool_call_id': getattr(message, 'tool_call_id', None) - }) - - return { - 'tool_outputs': tool_outputs, - 'total_messages': len(messages), - 'tool_message_count': len(tool_outputs) - } diff --git a/notebooks/agents/langgraph_agent_demo.ipynb b/notebooks/agents/langgraph_agent_demo.ipynb deleted file mode 100644 index 009369840..000000000 --- a/notebooks/agents/langgraph_agent_demo.ipynb +++ /dev/null @@ -1,1488 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "# LangGraph Agent Model Documentation\n", - "\n", - "This notebook demonstrates how to build and validate sophisticated AI agents using LangGraph integrated with ValidMind for comprehensive testing and monitoring.\n", - "\n", - "Learn how to create intelligent agents that can:\n", - "- **Automatically select appropriate tools** based on user queries using LLM-powered routing\n", - "- **Manage complex workflows** with state management and memory\n", - "- **Handle multiple tools conditionally** with smart decision-making\n", - "- **Provide validation and testing** through ValidMind integration\n", - "\n", - "We'll build a complete agent system that intelligently routes user requests to specialized tools like calculators, weather services, document search, and validation tools, then validate its performance using ValidMind's testing framework.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "## Setup and Imports\n", - "\n", - "First, let's import all the necessary libraries for building our LangGraph agent system:\n", - "\n", - "- **LangChain components** for LLM integration and tool management\n", - "- **LangGraph** for building stateful, multi-step agent workflows \n", - "- **ValidMind** for model validation and testing\n", - "- **Standard libraries** for data handling and environment management\n", - "\n", - "The setup includes loading environment variables (like OpenAI API keys) needed for the LLM components to function properly.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q langgraph langchain validmind openai" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from typing import TypedDict, List, Annotated, Sequence, Optional, Dict, Any\n", - "from langchain.tools import tool\n", - "from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage\n", - "from langchain_openai import ChatOpenAI\n", - "from langgraph.graph import StateGraph, END, START\n", - "from langgraph.prebuilt import ToolNode\n", - "from langgraph.checkpoint.memory import MemorySaver\n", - "from langgraph.graph.message import add_messages\n", - "import json\n", - "\n", - "# Load environment variables if using .env file\n", - "try:\n", - " from dotenv import load_dotenv\n", - " load_dotenv()\n", - "except ImportError:\n", - " print(\"dotenv not installed. Make sure OPENAI_API_KEY is set in your environment.\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import validmind as vm\n", - "\n", - "vm.init(\n", - " api_host=\"...\",\n", - " api_key=\"...\",\n", - " api_secret=\"...\",\n", - " model=\"...\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "## LLM-Powered Tool Selection Router\n", - "\n", - "This section demonstrates how to create an intelligent router that uses an LLM to select the most appropriate tool based on user input and tool docstrings.\n", - "\n", - "### Benefits of LLM-Based Tool Selection:\n", - "- **Intelligent Routing**: Understanding of natural language intent\n", - "- **Dynamic Selection**: Can handle complex, multi-step requests \n", - "- **Context Awareness**: Considers conversation history and context\n", - "- **Flexible Matching**: Not limited to keyword patterns\n", - "- **Tool Documentation**: Uses actual tool docstrings for decision making\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Enhanced Tools with Rich Docstrings\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Advanced Calculator Tool\n", - "@tool\n", - "def advanced_calculator(expression: str) -> str:\n", - " \"\"\"\n", - " Perform mathematical calculations and solve arithmetic expressions.\n", - " \n", - " This tool can handle:\n", - " - Basic arithmetic: addition (+), subtraction (-), multiplication (*), division (/)\n", - " - Mathematical functions: sqrt, sin, cos, tan, log, exp\n", - " - Constants: pi, e\n", - " - Parentheses for order of operations\n", - " - Decimal numbers and scientific notation\n", - " \n", - " Args:\n", - " expression (str): Mathematical expression to evaluate (e.g., \"2 + 3 * 4\", \"sqrt(16)\", \"sin(pi/2)\")\n", - " \n", - " Returns:\n", - " str: Result of the calculation or error message\n", - " \n", - " Examples:\n", - " - \"Calculate 15 * 7 + 23\"\n", - " - \"What is the square root of 144?\"\n", - " - \"Solve 2^8\"\n", - " - \"What's 25% of 200?\"\n", - " \"\"\"\n", - " import math\n", - " import re\n", - " \n", - " try:\n", - " # Sanitize and evaluate safely\n", - " safe_expression = expression.replace('^', '**') # Handle exponents\n", - " safe_expression = re.sub(r'[^0-9+\\-*/().,\\s]', '', safe_expression)\n", - " \n", - " # Add math functions\n", - " safe_dict = {\n", - " \"__builtins__\": {},\n", - " \"sqrt\": math.sqrt,\n", - " \"sin\": math.sin,\n", - " \"cos\": math.cos,\n", - " \"tan\": math.tan,\n", - " \"log\": math.log,\n", - " \"exp\": math.exp,\n", - " \"pi\": math.pi,\n", - " \"e\": math.e,\n", - " }\n", - " \n", - " result = eval(safe_expression, safe_dict)\n", - " return f\"The result is: {result}\"\n", - " except Exception as e:\n", - " return f\"Error calculating '{expression}': {str(e)}\"\n", - "\n", - "# Weather Service Tool\n", - "@tool\n", - "def weather_service(location: str, forecast_days: Optional[int] = 1) -> str:\n", - " \"\"\"\n", - " Get current weather conditions and forecasts for any city worldwide.\n", - " \n", - " This tool provides:\n", - " - Current temperature, humidity, and weather conditions\n", - " - Multi-day weather forecasts (up to 7 days)\n", - " - Weather alerts and warnings\n", - " - Historical weather data\n", - " - Seasonal weather patterns\n", - " \n", - " Args:\n", - " location (str): City name, coordinates, or location identifier\n", - " forecast_days (int, optional): Number of forecast days (1-7). Defaults to 1.\n", - " \n", - " Returns:\n", - " str: Weather information for the specified location\n", - " \n", - " Examples:\n", - " - \"What's the weather in Tokyo?\"\n", - " - \"Give me a 3-day forecast for London\"\n", - " - \"Is it going to rain in New York tomorrow?\"\n", - " - \"What's the temperature in Paris right now?\"\n", - " \"\"\"\n", - " import random\n", - " \n", - " conditions = [\"sunny\", \"cloudy\", \"partly cloudy\", \"rainy\", \"stormy\", \"snowy\"]\n", - " temp = random.randint(-10, 35)\n", - " condition = random.choice(conditions)\n", - " \n", - " forecast = f\"Weather in {location}:\\n\"\n", - " forecast += f\"Current: {condition}, {temp}°C\\n\"\n", - " \n", - " if forecast_days > 1:\n", - " forecast += f\"\\n{forecast_days}-day forecast:\\n\"\n", - " for day in range(1, forecast_days + 1):\n", - " day_temp = temp + random.randint(-5, 5)\n", - " day_condition = random.choice(conditions)\n", - " forecast += f\"Day {day}: {day_condition}, {day_temp}°C\\n\"\n", - " \n", - " return forecast\n", - "\n", - "# Document Search Engine Tool\n", - "@tool\n", - "def document_search_engine(query: str, document_type: Optional[str] = \"all\") -> str:\n", - " \"\"\"\n", - " Search through internal documents, policies, and knowledge base.\n", - " \n", - " This tool can search for:\n", - " - Company policies and procedures\n", - " - Technical documentation and manuals\n", - " - Compliance and regulatory documents\n", - " - Historical records and reports\n", - " - Product specifications and requirements\n", - " - Legal documents and contracts\n", - " \n", - " Args:\n", - " query (str): Search terms or questions about documents\n", - " document_type (str, optional): Type of document to search (\"policy\", \"technical\", \"legal\", \"all\")\n", - " \n", - " Returns:\n", - " str: Relevant document excerpts and references\n", - " \n", - " Examples:\n", - " - \"Find our data privacy policy\"\n", - " - \"Search for loan approval procedures\"\n", - " - \"What are the security guidelines for API access?\"\n", - " - \"Show me compliance requirements for financial reporting\"\n", - " \"\"\"\n", - " document_db = {\n", - " \"policy\": [\n", - " \"Data Privacy Policy: All personal data must be encrypted...\",\n", - " \"Remote Work Policy: Employees may work remotely up to 3 days...\",\n", - " \"Security Policy: All systems require multi-factor authentication...\"\n", - " ],\n", - " \"technical\": [\n", - " \"API Documentation: REST endpoints available at /api/v1/...\",\n", - " \"Database Schema: User table contains id, name, email...\",\n", - " \"Deployment Guide: Use Docker containers with Kubernetes...\"\n", - " ],\n", - " \"legal\": [\n", - " \"Terms of Service: By using this service, you agree to...\",\n", - " \"Privacy Notice: We collect information to provide services...\",\n", - " \"Compliance Framework: SOX requirements mandate quarterly audits...\"\n", - " ]\n", - " }\n", - " \n", - " results = []\n", - " search_types = [document_type] if document_type != \"all\" else document_db.keys()\n", - " \n", - " for doc_type in search_types:\n", - " if doc_type in document_db:\n", - " for doc in document_db[doc_type]:\n", - " if any(term.lower() in doc.lower() for term in query.split()):\n", - " results.append(f\"[{doc_type.upper()}] {doc}\")\n", - " \n", - " if not results:\n", - " results.append(f\"No documents found matching '{query}'\")\n", - " \n", - " return \"\\n\\n\".join(results)\n", - "\n", - "# Smart Validator Tool\n", - "@tool\n", - "def smart_validator(input_data: str, validation_type: str = \"auto\") -> str:\n", - " \"\"\"\n", - " Validate and verify various types of data and inputs.\n", - " \n", - " This tool can validate:\n", - " - Email addresses (format, domain, deliverability)\n", - " - Phone numbers (format, country code, carrier info)\n", - " - URLs and web addresses\n", - " - Credit card numbers (format, type, checksum)\n", - " - Social security numbers and tax IDs\n", - " - Postal codes and addresses\n", - " - Date formats and ranges\n", - " - File formats and data integrity\n", - " \n", - " Args:\n", - " input_data (str): Data to validate\n", - " validation_type (str): Type of validation (\"email\", \"phone\", \"url\", \"auto\")\n", - " \n", - " Returns:\n", - " str: Validation results with detailed feedback\n", - " \n", - " Examples:\n", - " - \"Validate this email: user@example.com\"\n", - " - \"Is this a valid phone number: +1-555-123-4567?\"\n", - " - \"Check if this URL is valid: https://example.com\"\n", - " - \"Verify this credit card format: 4111-1111-1111-1111\"\n", - " \"\"\"\n", - " import re\n", - " \n", - " if validation_type == \"auto\":\n", - " # Auto-detect validation type\n", - " if \"@\" in input_data and \".\" in input_data:\n", - " validation_type = \"email\"\n", - " elif any(char.isdigit() for char in input_data) and any(char in \"+-() \" for char in input_data):\n", - " validation_type = \"phone\"\n", - " elif input_data.startswith((\"http://\", \"https://\", \"www.\")):\n", - " validation_type = \"url\"\n", - " else:\n", - " validation_type = \"general\"\n", - " \n", - " if validation_type == \"email\":\n", - " pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n", - " is_valid = re.match(pattern, input_data) is not None\n", - " return f\"Email '{input_data}' is {'valid' if is_valid else 'invalid'}\"\n", - " \n", - " elif validation_type == \"phone\":\n", - " pattern = r'^\\+?1?[-.\\s]?\\(?[0-9]{3}\\)?[-.\\s]?[0-9]{3}[-.\\s]?[0-9]{4}$'\n", - " is_valid = re.match(pattern, input_data) is not None\n", - " return f\"Phone number '{input_data}' is {'valid' if is_valid else 'invalid'}\"\n", - " \n", - " elif validation_type == \"url\":\n", - " pattern = r'^https?://(?:[-\\w.])+(?:\\:[0-9]+)?(?:/(?:[\\w/_.])*(?:\\?(?:[\\w&=%.])*)?(?:\\#(?:[\\w.])*)?)?$'\n", - " is_valid = re.match(pattern, input_data) is not None\n", - " return f\"URL '{input_data}' is {'valid' if is_valid else 'invalid'}\"\n", - " \n", - " else:\n", - " return f\"Performed general validation on '{input_data}' - appears to be safe text input\"\n", - "\n", - "# Task Assistant Tool\n", - "@tool\n", - "def task_assistant(task_description: str, context: Optional[str] = None) -> str:\n", - " \"\"\"\n", - " General-purpose task assistance and problem-solving tool.\n", - " \n", - " This tool can help with:\n", - " - Breaking down complex tasks into steps\n", - " - Providing guidance and recommendations\n", - " - Answering questions and explaining concepts\n", - " - Suggesting solutions to problems\n", - " - Planning and organizing activities\n", - " - Research and information gathering\n", - " \n", - " Args:\n", - " task_description (str): Description of the task or question\n", - " context (str, optional): Additional context or background information\n", - " \n", - " Returns:\n", - " str: Helpful guidance, steps, or information for the task\n", - " \n", - " Examples:\n", - " - \"How do I prepare for a job interview?\"\n", - " - \"What are the steps to deploy a web application?\"\n", - " - \"Help me plan a team meeting agenda\"\n", - " - \"Explain machine learning concepts for beginners\"\n", - " \"\"\"\n", - " responses = {\n", - " \"meeting\": \"For planning meetings: 1) Define objectives, 2) Create agenda, 3) Invite participants, 4) Prepare materials, 5) Set time limits\",\n", - " \"interview\": \"Interview preparation: 1) Research the company, 2) Practice common questions, 3) Prepare examples, 4) Plan your outfit, 5) Arrive early\",\n", - " \"deploy\": \"Deployment steps: 1) Test in staging, 2) Backup production, 3) Deploy code, 4) Run health checks, 5) Monitor performance\",\n", - " \"learning\": \"Learning approach: 1) Start with basics, 2) Practice regularly, 3) Build projects, 4) Join communities, 5) Stay updated\"\n", - " }\n", - " \n", - " task_lower = task_description.lower()\n", - " for key, response in responses.items():\n", - " if key in task_lower:\n", - " return f\"Task assistance for '{task_description}':\\n\\n{response}\"\n", - " \n", - " \n", - " return f\"\"\"For the task '{task_description}', I recommend: 1) Break it into smaller steps, 2) Gather necessary resources, 3)\n", - " Create a timeline, 4) Start with the most critical parts, 5) Review and adjust as needed.\n", - " \"\"\"\n", - "\n", - "# Collect all tools for the LLM router\n", - "AVAILABLE_TOOLS = [\n", - " advanced_calculator,\n", - " weather_service, \n", - " document_search_engine,\n", - " smart_validator,\n", - " task_assistant\n", - "]\n", - "\n", - "print(\"Enhanced tools with rich docstrings created!\")\n", - "print(f\"Available tools: {len(AVAILABLE_TOOLS)}\")\n", - "for tool in AVAILABLE_TOOLS:\n", - " print(f\" - {tool.name}: {tool.description[:50]}...\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Tool Selection Router" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def create_llm_tool_router(available_tools: List, llm_model: str = \"gpt-4o-mini\"):\n", - " \"\"\"\n", - " Create an intelligent router that uses LLM to select appropriate tools.\n", - " \n", - " Args:\n", - " available_tools: List of LangChain tools with docstrings\n", - " llm_model: LLM model to use for routing decisions\n", - " \n", - " Returns:\n", - " Function that routes user input to appropriate tools\n", - " \"\"\"\n", - " \n", - " # Initialize LLM for routing decisions\n", - " routing_llm = ChatOpenAI(model=llm_model, temperature=0.1)\n", - " \n", - " def generate_tool_descriptions(tools: List) -> str:\n", - " \"\"\"Generate formatted tool descriptions for the LLM.\"\"\"\n", - " descriptions = []\n", - " for tool in tools:\n", - " tool_info = {\n", - " \"name\": tool.name,\n", - " \"description\": tool.description,\n", - " \"args\": tool.args if hasattr(tool, 'args') else {},\n", - " \"examples\": []\n", - " }\n", - " \n", - " # Extract examples from docstring if available\n", - " if hasattr(tool, 'func') and tool.func.__doc__:\n", - " docstring = tool.func.__doc__\n", - " if \"Examples:\" in docstring:\n", - " examples_section = docstring.split(\"Examples:\")[1]\n", - " examples = [line.strip().replace(\"- \", \"\") for line in examples_section.split(\"\\n\") \n", - " if line.strip() and line.strip().startswith(\"-\")]\n", - " tool_info[\"examples\"] = examples[:3] # Limit to 3 examples\n", - " \n", - " descriptions.append(tool_info)\n", - " \n", - " return json.dumps(descriptions, indent=2)\n", - " \n", - " def intelligent_router(user_input: str, conversation_history: List = None) -> Dict[str, Any]:\n", - " \"\"\"\n", - " Use LLM to intelligently select the most appropriate tool(s).\n", - " \n", - " Args:\n", - " user_input: User's request/question\n", - " conversation_history: Previous conversation context\n", - " \n", - " Returns:\n", - " Dict with routing decision and reasoning\n", - " \"\"\"\n", - " \n", - " # Generate tool descriptions\n", - " tool_descriptions = generate_tool_descriptions(available_tools)\n", - " \n", - " # Build context from conversation history\n", - " context = \"\"\n", - " if conversation_history and len(conversation_history) > 0:\n", - " recent_messages = conversation_history[-4:] # Last 4 messages for context\n", - " context = \"\\n\".join([f\"{msg.type}: {msg.content[:100]}...\" \n", - " for msg in recent_messages if hasattr(msg, 'content')])\n", - " \n", - " # Create the routing prompt\n", - " routing_prompt = f\"\"\"You are an intelligent tool router. Your job is to analyze user requests and select the most appropriate tool(s) to handle them.\n", - "\n", - " AVAILABLE TOOLS:\n", - " {tool_descriptions}\n", - "\n", - " CONVERSATION CONTEXT:\n", - " {context if context else \"No previous context\"}\n", - "\n", - " USER REQUEST: \"{user_input}\"\n", - "\n", - " Analyze the user's request and determine:\n", - " 1. Which tool(s) would best handle this request\n", - " 2. If multiple tools are needed, what's the order?\n", - " 3. What parameters should be passed to each tool?\n", - " 4. If no tools are needed, should this go to general conversation?\n", - "\n", - " Respond in this JSON format:\n", - " {{\n", - " \"routing_decision\": \"tool_required\" | \"general_conversation\" | \"help_request\",\n", - " \"selected_tools\": [\n", - " {{\n", - " \"tool_name\": \"tool_name\",\n", - " \"confidence\": 0.95,\n", - " \"parameters\": {{\"param\": \"value\"}},\n", - " \"reasoning\": \"Why this tool was selected\"\n", - " }}\n", - " ],\n", - " \"execution_order\": [\"tool1\", \"tool2\"],\n", - " \"overall_reasoning\": \"Overall analysis of the request\"\n", - " }}\n", - "\n", - " IMPORTANT: Be precise with tool selection. Consider the tool descriptions and examples carefully.\"\"\"\n", - "\n", - " try:\n", - " # Get LLM routing decision\n", - " response = routing_llm.invoke([\n", - " SystemMessage(content=\"You are a precise tool routing specialist. Always respond with valid JSON.\"),\n", - " HumanMessage(content=routing_prompt)\n", - " ])\n", - " \n", - " print(f\"Conversation history: {conversation_history}\")\n", - " print(f\"Routing response: {response}\")\n", - " # Parse the response\n", - " routing_result = json.loads(response.content)\n", - " print(f\"Routing result: {routing_result}\")\n", - "\n", - " # Validate and enhance the result\n", - " validated_result = validate_routing_decision(routing_result, available_tools)\n", - " \n", - " return validated_result\n", - " \n", - " except json.JSONDecodeError as e:\n", - " # Fallback to simple routing if JSON parsing fails\n", - " return {\n", - " \"routing_decision\": \"general_conversation\",\n", - " \"selected_tools\": [],\n", - " \"execution_order\": [],\n", - " \"overall_reasoning\": f\"Failed to parse LLM response: {e}\",\n", - " \"fallback\": True\n", - " }\n", - " except Exception as e:\n", - " # General error fallback\n", - " return {\n", - " \"routing_decision\": \"general_conversation\", \n", - " \"selected_tools\": [],\n", - " \"execution_order\": [],\n", - " \"overall_reasoning\": f\"Router error: {e}\",\n", - " \"error\": True\n", - " }\n", - " \n", - " def validate_routing_decision(decision: Dict, tools: List) -> Dict:\n", - " \"\"\"Validate and enhance the routing decision.\"\"\"\n", - " \n", - " # Get available tool names\n", - " tool_names = [tool.name for tool in tools]\n", - " \n", - " # Validate selected tools exist\n", - " valid_tools = []\n", - " for tool_selection in decision.get(\"selected_tools\", []):\n", - " tool_name = tool_selection.get(\"tool_name\")\n", - " if tool_name in tool_names:\n", - " valid_tools.append(tool_selection)\n", - " else:\n", - " # Find closest match\n", - " from difflib import get_close_matches\n", - " matches = get_close_matches(tool_name, tool_names, n=1, cutoff=0.6)\n", - " if matches:\n", - " tool_selection[\"tool_name\"] = matches[0]\n", - " tool_selection[\"corrected\"] = True\n", - " valid_tools.append(tool_selection)\n", - " \n", - " # Update the decision\n", - " decision[\"selected_tools\"] = valid_tools\n", - " decision[\"execution_order\"] = [tool[\"tool_name\"] for tool in valid_tools]\n", - " \n", - " # Add tool count\n", - " decision[\"tool_count\"] = len(valid_tools)\n", - " \n", - " return decision\n", - " \n", - " return intelligent_router\n", - "\n", - "# Create the intelligent router\n", - "intelligent_tool_router = create_llm_tool_router(AVAILABLE_TOOLS)\n", - "\n", - "print(\"LLM-Powered Tool Router Created!\")\n", - "print(\"Router Features:\")\n", - "print(\" - Uses LLM for intelligent tool selection\")\n", - "print(\" - Analyzes tool docstrings and examples\")\n", - "print(\" - Considers conversation context\")\n", - "print(\" - Provides confidence scores and reasoning\")\n", - "print(\" - Handles multi-tool requests\")\n", - "print(\" - Validates tool selections\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Complete LangGraph Agent with Intelligent Router\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "# Enhanced Agent State\n", - "class IntelligentAgentState(TypedDict):\n", - " messages: Annotated[Sequence[BaseMessage], add_messages]\n", - " user_input: str\n", - " session_id: str\n", - " context: dict\n", - " routing_result: dict # Store LLM routing decision\n", - " selected_tools: list\n", - " tool_results: dict\n", - "\n", - "def create_intelligent_langgraph_agent():\n", - " \"\"\"Create a LangGraph agent with LLM-powered tool selection.\"\"\"\n", - " \n", - " # Initialize the main LLM for responses\n", - " main_llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0.7)\n", - " \n", - " # Bind tools to the main LLM\n", - " llm_with_tools = main_llm.bind_tools(AVAILABLE_TOOLS)\n", - " \n", - " def intelligent_router_node(state: IntelligentAgentState) -> IntelligentAgentState:\n", - " \"\"\"Router node that uses LLM to select appropriate tools.\"\"\"\n", - " \n", - " user_input = state[\"user_input\"]\n", - " messages = state.get(\"messages\", [])\n", - " \n", - " print(f\"Router analyzing: '{user_input}'\")\n", - " \n", - " # Use the intelligent router to analyze the request\n", - " routing_result = intelligent_tool_router(user_input, messages)\n", - " \n", - " print(f\"Routing decision: {routing_result['routing_decision']}\")\n", - " print(f\"Selected tools: {[tool['tool_name'] for tool in routing_result.get('selected_tools', [])]}\")\n", - " \n", - " # Store routing result in state\n", - " return {\n", - " **state,\n", - " \"routing_result\": routing_result,\n", - " \"selected_tools\": routing_result.get(\"selected_tools\", [])\n", - " }\n", - " \n", - " def llm_node(state: IntelligentAgentState) -> IntelligentAgentState:\n", - " \"\"\"Main LLM node that processes requests and decides on tool usage.\"\"\"\n", - " \n", - " messages = state[\"messages\"]\n", - " routing_result = state.get(\"routing_result\", {})\n", - " \n", - " # Create a system message based on routing analysis\n", - " system_context = f\"\"\"You are a helpful AI assistant with access to specialized tools.\n", - " ROUTING ANALYSIS:\n", - " - Decision: {routing_result.get('routing_decision', 'unknown')}\n", - " - Reasoning: {routing_result.get('overall_reasoning', 'No analysis available')}\n", - " - Selected Tools: {[tool['tool_name'] for tool in routing_result.get('selected_tools', [])]}\n", - " Based on the routing analysis, use the appropriate tools to help the user. If tools were recommended, use them. If not, respond conversationally.\n", - " \"\"\"\n", - " \n", - " # Add system context to messages\n", - " enhanced_messages = [SystemMessage(content=system_context)] + list(messages)\n", - " \n", - " # Get LLM response\n", - " response = llm_with_tools.invoke(enhanced_messages)\n", - " \n", - " return {\n", - " **state,\n", - " \"messages\": messages + [response]\n", - " }\n", - " \n", - " def should_continue(state: IntelligentAgentState) -> str:\n", - " \"\"\"Decide whether to use tools or end the conversation.\"\"\"\n", - " last_message = state[\"messages\"][-1]\n", - " \n", - " # Check if the LLM wants to use tools\n", - " if hasattr(last_message, 'tool_calls') and last_message.tool_calls:\n", - " return \"tools\"\n", - " \n", - " return END\n", - " \n", - " def help_node(state: IntelligentAgentState) -> IntelligentAgentState:\n", - " \"\"\"Provide help information about available capabilities.\"\"\"\n", - " \n", - " help_message = f\"\"\"🤖 **AI Assistant Capabilities**\n", - " \n", - " I'm an intelligent assistant with access to specialized tools. Here's what I can help you with:\n", - "\n", - " 🧮 **Advanced Calculator** - Mathematical calculations and expressions\n", - " Examples: \"Calculate the square root of 144\", \"What's 25% of 200?\"\n", - "\n", - " 🌤️ **Weather Service** - Current weather and forecasts worldwide \n", - " Examples: \"Weather in Tokyo\", \"3-day forecast for London\"\n", - "\n", - " 🔍 **Document Search** - Find information in internal documents\n", - " Examples: \"Find privacy policy\", \"Search for API documentation\"\n", - "\n", - " ✅ **Smart Validator** - Validate emails, phone numbers, URLs, etc.\n", - " Examples: \"Validate user@example.com\", \"Check this phone number\"\n", - "\n", - " 🎯 **Task Assistant** - General guidance and problem-solving\n", - " Examples: \"How to prepare for an interview\", \"Help plan a meeting\"\n", - "\n", - " Just describe what you need in natural language, and I'll automatically select the right tools to help you!\"\"\"\n", - " \n", - " messages = state.get(\"messages\", [])\n", - " return {\n", - " **state,\n", - " \"messages\": messages + [AIMessage(content=help_message)]\n", - " }\n", - " \n", - " # Create the state graph\n", - " workflow = StateGraph(IntelligentAgentState)\n", - " \n", - " # Add nodes\n", - " workflow.add_node(\"router\", intelligent_router_node)\n", - " workflow.add_node(\"llm\", llm_node) \n", - " workflow.add_node(\"tools\", ToolNode(AVAILABLE_TOOLS))\n", - " workflow.add_node(\"help\", help_node)\n", - " \n", - " # Set entry point\n", - " workflow.add_edge(START, \"router\")\n", - " \n", - " # Conditional routing from router based on LLM analysis\n", - " def route_after_analysis(state: IntelligentAgentState) -> str:\n", - " \"\"\"Route based on the LLM's analysis.\"\"\"\n", - " routing_result = state.get(\"routing_result\", {})\n", - " decision = routing_result.get(\"routing_decision\", \"general_conversation\")\n", - " \n", - " if decision == \"help_request\":\n", - " return \"help\"\n", - " else:\n", - " return \"llm\" # Let LLM handle both tool usage and general conversation\n", - " \n", - " workflow.add_conditional_edges(\n", - " \"router\",\n", - " route_after_analysis,\n", - " {\"help\": \"help\", \"llm\": \"llm\"}\n", - " )\n", - " \n", - " # From LLM, decide whether to use tools or end\n", - " workflow.add_conditional_edges(\n", - " \"llm\",\n", - " should_continue,\n", - " {\"tools\": \"tools\", END: END}\n", - " )\n", - " \n", - " # Tool execution flows back to LLM for final response\n", - " workflow.add_edge(\"tools\", \"llm\")\n", - " \n", - " # Help goes to end\n", - " workflow.add_edge(\"help\", END)\n", - " \n", - " # Set up memory\n", - " memory = MemorySaver()\n", - " \n", - " # Compile the graph\n", - " agent = workflow.compile(checkpointer=memory)\n", - " \n", - " return agent\n", - "\n", - "# Create the intelligent agent\n", - "intelligent_agent = create_intelligent_langgraph_agent()\n", - "\n", - "print(\"Intelligent LangGraph Agent Created!\")\n", - "print(\"Features:\")\n", - "print(\" - LLM-powered tool selection\")\n", - "print(\" - Analyzes tool docstrings and examples\")\n", - "print(\" - Context-aware routing decisions\")\n", - "print(\" - Automatic tool parameter extraction\")\n", - "print(\" - Confidence scoring and reasoning\")\n", - "print(\" - Fallback handling for edge cases\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ValidMind Model Integration\n", - "\n", - "Now we'll integrate our LangGraph agent with ValidMind for comprehensive testing and validation. This step is crucial for:\n", - "\n", - "**Model Wrapping**: We create a wrapper function (`agent_fn`) that standardizes the agent interface for ValidMind\n", - "- **Input Formatting**: Converts ValidMind inputs to the agent's expected format\n", - "- **State Management**: Handles session configuration and conversation threads\n", - "- **Result Processing**: Returns agent responses in a consistent format\n", - "\n", - "**ValidMind Agent Initialization**: Using `vm.init_model()` creates a ValidMind model object that:\n", - "- **Enables Testing**: Allows us to run validation tests on the agent\n", - "- **Tracks Performance**: Monitors agent behavior and responses \n", - "- **Provides Documentation**: Generates documentation and analysis reports\n", - "- **Supports Evaluation**: Enables quantitative assessment of agent capabilities\n", - "\n", - "This integration allows us to treat our LangGraph agent like any other machine learning model in the ValidMind ecosystem, enabling comprehensive testing and validation workflows." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def agent_fn(input):\n", - " \"\"\"\n", - " Invoke the financial agent with the given input.\n", - " \"\"\"\n", - " initial_state = {\n", - " \"user_input\": input[\"input\"],\n", - " \"messages\": [HumanMessage(content=input[\"input\"])],\n", - " \"session_id\": input[\"session_id\"],\n", - " \"context\": {},\n", - " \"routing_result\": {},\n", - " \"selected_tools\": [],\n", - " \"tool_results\": {}\n", - "}\n", - "\n", - " session_config = {\"configurable\": {\"thread_id\": input[\"session_id\"]}}\n", - "\n", - " result = intelligent_agent.invoke(initial_state, config=session_config)\n", - "\n", - " return {\"prediction\": result['messages'][-1].content, \"output\": result, \"tools_used\": result['selected_tools']}\n", - "\n", - "\n", - "vm_intelligent_model = vm.init_model(input_id=\"financial_model\", predict_fn=agent_fn)\n", - "# add model to the vm agent\n", - "vm_intelligent_model.model = intelligent_agent" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_intelligent_model.model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prepare Sample Test Dataset\n", - "\n", - "We'll create a comprehensive test dataset to evaluate our agent's performance across different scenarios. This dataset includes:\n", - "\n", - "**Diverse Test Cases**: Various types of user requests that test different agent capabilities:\n", - "- **Single Tool Requests**: Simple queries that require one specific tool\n", - "- **Multi-Tool Requests**: Complex queries requiring multiple tools in sequence \n", - "- **Validation Tasks**: Requests for data validation and verification\n", - "- **General Assistance**: Open-ended questions for problem-solving guidance\n", - "\n", - "**Expected Outputs**: For each test case, we define:\n", - "- **Expected Tools**: Which tools should be selected by the router\n", - "- **Possible Outputs**: Valid response patterns or values\n", - "- **Session IDs**: Unique identifiers for conversation tracking\n", - "\n", - "**Test Coverage**: The dataset covers:\n", - "- Mathematical calculations (calculator tool)\n", - "- Weather information (weather service) \n", - "- Document retrieval (search engine)\n", - "- Data validation (validator tool)\n", - "- General guidance (task assistant)\n", - "\n", - "This structured approach allows us to systematically evaluate both tool selection accuracy and response quality." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import uuid\n", - "\n", - "test_dataset = pd.DataFrame([\n", - " {\n", - " \"input\": \"Calculate the square root of 256 plus 15\",\n", - " \"expected_tools\": [\"advanced_calculator\"],\n", - " \"possible_outputs\": [271],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"What's the weather like in Barcelona today?\", \n", - " \"expected_tools\": [\"weather_service\"],\n", - " \"possible_outputs\": [\"sunny\", \"rainy\", \"cloudy\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Find our company's data privacy policy\",\n", - " \"expected_tools\": [\"document_search_engine\"],\n", - " \"possible_outputs\": [\"privacy_policy.pdf\", \"data_protection.doc\", \"company_privacy_guidelines.txt\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Validate this email address: john.doe@company.com\",\n", - " \"expected_tools\": [\"smart_validator\"],\n", - " \"possible_outputs\": [\"valid\", \"invalid\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"How should I prepare for a technical interview?\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"algorithms\", \"data structures\", \"system design\", \"coding practice\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"What's 25% of 480 and show me the weather in Tokyo\",\n", - " \"expected_tools\": [\"advanced_calculator\", \"weather_service\"],\n", - " \"possible_outputs\": [120, \"sunny\", \"rainy\", \"cloudy\", \"20°C\", \"68°F\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Help me understand machine learning basics\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"supervised\", \"unsupervised\", \"neural networks\", \"training\", \"testing\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"What can you do for me?\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"calculator\", \"weather\", \"email validator\", \"document search\", \"general assistance\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Calculate 5+3 and check the weather in Paris\",\n", - " \"expected_tools\": [\"advanced_calculator\", \"weather_service\"],\n", - " \"possible_outputs\": [8, \"sunny\", \"rainy\", \"cloudy\", \"22°C\", \"72°F\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " }\n", - "])\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Initialize ValidMind Dataset\n", - "\n", - "Before we can run tests and evaluations, we need to initialize our test dataset as a ValidMind dataset object. This process:\n", - "\n", - "**Dataset Registration**: Creates a ValidMind dataset object that can be used in testing workflows\n", - "- **Input Identification**: Assigns a unique `input_id` for tracking and reference\n", - "- **Target Column Definition**: Specifies which column contains expected outputs for evaluation\n", - "- **Metadata Preservation**: Maintains all dataset information and structure\n", - "\n", - "**Testing Preparation**: The initialized dataset enables:\n", - "- **Systematic Evaluation**: Consistent testing across all data points\n", - "- **Performance Tracking**: Monitoring of agent responses and accuracy\n", - "- **Result Documentation**: Automatic generation of test reports and metrics\n", - "- **Comparison Analysis**: Benchmarking against expected outputs\n", - "\n", - "This step is essential for integrating our agent evaluation into ValidMind's comprehensive testing and validation framework.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset = vm.init_dataset(\n", - " input_id=\"test_dataset\",\n", - " dataset=test_dataset,\n", - " target_column=\"possible_outputs\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Run Agent and Assign Predictions\n", - "\n", - "Now we'll execute our agent on the test dataset and capture its responses for evaluation. This step:\n", - "\n", - "**Agent Execution**: Runs the agent on each test case in our dataset\n", - "- **Automatic Processing**: Iterates through all test inputs systematically\n", - "- **Response Capture**: Records complete agent responses including tool calls and outputs\n", - "- **Session Management**: Maintains separate conversation threads for each test case\n", - "- **Error Handling**: Gracefully manages any execution failures or timeouts\n", - "\n", - "**Prediction Assignment**: Links agent responses to the dataset for analysis\n", - "- **Response Mapping**: Associates each input with its corresponding agent output \n", - "- **Metadata Preservation**: Maintains conversation state, tool calls, and routing decisions\n", - "- **Format Standardization**: Ensures responses are in a consistent format for evaluation\n", - "\n", - "This process generates the prediction data needed for comprehensive performance evaluation and comparison against expected outputs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.assign_predictions(vm_intelligent_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Dataframe display settings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pd.set_option('display.max_colwidth', 40)\n", - "pd.set_option('display.width', 120)\n", - "pd.set_option('display.max_colwidth', None)\n", - "vm_test_dataset._df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Visualization\n", - "This section visualizes the LangGraph agent's workflow structure using Mermaid diagrams.\n", - "The test below validates that the agent's architecture is properly structured by:\n", - "- Checking if the model has a valid LangGraph Graph object\n", - "- Generating a visual representation of component connections and flow\n", - "- Ensuring the graph can be properly rendered as a Mermaid diagram" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import langgraph\n", - "\n", - "@vm.test(\"my_custom_tests.LangGraphVisualization\")\n", - "def LangGraphVisualization(model):\n", - " \"\"\"\n", - " Visualizes the LangGraph workflow structure using Mermaid diagrams.\n", - " \n", - " ### Purpose\n", - " Creates a visual representation of the LangGraph agent's workflow using Mermaid diagrams\n", - " to show the connections and flow between different components. This helps validate that\n", - " the agent's architecture is properly structured.\n", - " \n", - " ### Test Mechanism\n", - " 1. Retrieves the graph representation from the model using get_graph()\n", - " 2. Attempts to render it as a Mermaid diagram\n", - " 3. Returns the visualization and validation results\n", - " \n", - " ### Signs of High Risk\n", - " - Failure to generate graph visualization indicates potential structural issues\n", - " - Missing or broken connections between components\n", - " - Invalid graph structure that cannot be rendered\n", - " \"\"\"\n", - " try:\n", - " if not hasattr(model, 'model') or not isinstance(model.model, langgraph.graph.state.CompiledStateGraph):\n", - " return {\n", - " 'test_results': False,\n", - " 'summary': {\n", - " 'status': 'FAIL', \n", - " 'details': 'Model must have a LangGraph Graph object as model attribute'\n", - " }\n", - " }\n", - " graph = model.model.get_graph(xray=False)\n", - " mermaid_png = graph.draw_mermaid_png()\n", - " return mermaid_png\n", - " except Exception as e:\n", - " return {\n", - " 'test_results': False, \n", - " 'summary': {\n", - " 'status': 'FAIL',\n", - " 'details': f'Failed to generate graph visualization: {str(e)}'\n", - " }\n", - " }\n", - "\n", - "vm.tests.run_test(\n", - " \"my_custom_tests.LangGraphVisualization\",\n", - " inputs = {\n", - " \"model\": vm_intelligent_model\n", - " }\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Accuracy Test\n", - "The purpose of this test is to evaluate the agent's ability to provide accurate responses by:\n", - "- Testing against a dataset of predefined questions and expected answers\n", - "- Checking if responses contain expected keywords\n", - "- Providing detailed test results including pass/fail status\n", - "- Helping identify any gaps in the agent's knowledge or response quality" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import validmind as vm\n", - "\n", - "@vm.test(\"my_custom_tests.accuracy_test\")\n", - "def accuracy_test(model, dataset, list_of_columns):\n", - " \"\"\"\n", - " Run tests on a dataset of questions and expected responses.\n", - " Optimized version using vectorized operations and list comprehension.\n", - " \"\"\"\n", - " df = dataset._df\n", - " \n", - " # Pre-compute responses for all tests\n", - " y_true = dataset.y.tolist()\n", - " y_pred = dataset.y_pred(model).tolist()\n", - "\n", - " # Vectorized test results\n", - " test_results = []\n", - " for response, keywords in zip(y_pred, y_true):\n", - " test_results.append(any(str(keyword).lower() in str(response).lower() for keyword in keywords))\n", - " \n", - " results = pd.DataFrame()\n", - " column_names = [col + \"_details\" for col in list_of_columns]\n", - " results[column_names] = df[list_of_columns]\n", - " results[\"actual\"] = y_pred\n", - " results[\"expected\"] = y_true\n", - " results[\"passed\"] = test_results\n", - " results[\"error\"] = None if test_results else f'Response did not contain any expected keywords: {y_true}'\n", - " \n", - " return results\n", - " \n", - "result = vm.tests.run_test(\n", - " \"my_custom_tests.accuracy_test\",\n", - " inputs={\n", - " \"dataset\": vm_test_dataset,\n", - " \"model\": vm_intelligent_model\n", - " },\n", - " params={\n", - " \"list_of_columns\": [\"input\"]\n", - " }\n", - ")\n", - "result.log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Tool Call Accuracy Test\n", - "\n", - "This test evaluates how accurately our intelligent router selects the correct tools for different user requests. It's a critical validation step that measures:\n", - "\n", - "**Tool Selection Performance**: Analyzes whether the agent correctly identifies and calls the expected tools\n", - "- **Expected vs. Actual**: Compares tools that should be called with tools that were actually called\n", - "- **Accuracy Scoring**: Calculates percentage accuracy for tool selection decisions\n", - "- **Multi-tool Handling**: Evaluates performance on requests requiring multiple tools\n", - "\n", - "**Router Intelligence Assessment**: Validates the LLM-powered routing system's effectiveness\n", - "- **Intent Recognition**: How well the router understands user intent from natural language\n", - "- **Tool Mapping**: Accuracy of mapping user needs to appropriate tool capabilities\n", - "- **Decision Quality**: Assessment of routing confidence and reasoning\n", - "\n", - "**Failure Analysis**: Identifies patterns in incorrect tool selections to improve the routing logic\n", - "- **Missed Tools**: Cases where expected tools weren't selected\n", - "- **Extra Tools**: Cases where unnecessary tools were selected \n", - "- **Wrong Tools**: Cases where completely incorrect tools were selected\n", - "\n", - "This test provides quantitative feedback on the agent's core intelligence - its ability to understand what users need and select the right tools to help them." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import validmind as vm\n", - "\n", - "# Test with a real LangGraph result instead of creating mock objects\n", - "@vm.test(\"my_custom_tests.tool_call_accuracy\")\n", - "def tool_call_accuracy(dataset, agent_output_column, expected_tools_column):\n", - " \"\"\"Test validation using actual LangGraph agent results.\"\"\"\n", - " # Let's create a simpler validation without the complex RAGAS setup\n", - " def validate_tool_calls_simple(messages, expected_tools):\n", - " \"\"\"Simple validation of tool calls without RAGAS dependency issues.\"\"\"\n", - " \n", - " tool_calls_found = []\n", - " \n", - " for message in messages:\n", - " if hasattr(message, 'tool_calls') and message.tool_calls:\n", - " for tool_call in message.tool_calls:\n", - " # Handle both dictionary and object formats\n", - " if isinstance(tool_call, dict):\n", - " tool_calls_found.append(tool_call['name'])\n", - " else:\n", - " # ToolCall object - use attribute access\n", - " tool_calls_found.append(tool_call.name)\n", - " \n", - " # Check if expected tools were called\n", - " accuracy = 0.0\n", - " matches = 0\n", - " if expected_tools:\n", - " matches = sum(1 for tool in expected_tools if tool in tool_calls_found)\n", - " accuracy = matches / len(expected_tools)\n", - " \n", - " return {\n", - " 'accuracy': accuracy,\n", - " 'expected_tools': expected_tools,\n", - " 'found_tools': tool_calls_found,\n", - " 'matches': matches,\n", - " 'total_expected': len(expected_tools) if expected_tools else 0\n", - " }\n", - "\n", - " df = dataset._df\n", - " \n", - " results = []\n", - " for i, row in df.iterrows():\n", - " result = validate_tool_calls_simple(row[agent_output_column]['messages'], row[expected_tools_column])\n", - " results.append(result)\n", - " \n", - " return results\n", - "\n", - "vm.tests.run_test(\n", - " \"my_custom_tests.tool_call_accuracy\",\n", - " inputs = {\n", - " \"dataset\": vm_test_dataset,\n", - " },\n", - " params = {\n", - " \"agent_output_column\": \"output\",\n", - " \"expected_tools_column\": \"expected_tools\"\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## RAGAS Tests for Agent Evaluation\n", - "\n", - "RAGAS (Retrieval-Augmented Generation Assessment) provides specialized metrics for evaluating conversational AI systems like our LangGraph agent. These tests analyze different aspects of agent performance:\n", - "\n", - "**Why RAGAS for Agents**: Our agent uses tools to retrieve information (weather, documents, calculations) and generates responses based on that context, making it similar to a RAG system. RAGAS metrics help evaluate:\n", - "\n", - "- **Response Quality**: How well the agent uses retrieved tool outputs to generate helpful responses\n", - "- **Information Faithfulness**: Whether agent responses accurately reflect tool outputs \n", - "- **Relevance Assessment**: How well responses address the original user query\n", - "- **Context Utilization**: How effectively the agent incorporates tool results into final answers\n", - "\n", - "**Test Preparation**: We extract tool outputs as \"context\" for RAGAS evaluation:\n", - "- **Tool Message Extraction**: Capture outputs from calculator, weather, search, and validation tools\n", - "- **Context Mapping**: Treat tool results as retrieved context for evaluation\n", - "- **Response Analysis**: Evaluate final agent responses against both user input and tool context\n", - "\n", - "These tests provide insights into how well our agent integrates tool usage with conversational abilities, ensuring it provides accurate, relevant, and helpful responses to users.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Dataset Preparation - Extract Context from Agent State\n", - "\n", - "Before running RAGAS tests, we need to extract and prepare the context information from our agent's execution results. This process:\n", - "\n", - "**Tool Output Extraction**: Retrieves the outputs from tools used during agent execution\n", - "- **Message Parsing**: Analyzes the agent's conversation state to find tool outputs\n", - "- **Content Aggregation**: Combines outputs from multiple tools when used in sequence\n", - "- **Context Formatting**: Structures tool outputs as context for RAGAS evaluation\n", - "\n", - "**RAGAS Format Preparation**: Converts agent data into the format expected by RAGAS metrics\n", - "- **User Input**: Original user queries from the test dataset\n", - "- **Retrieved Context**: Tool outputs treated as \"retrieved\" information \n", - "- **Agent Response**: Final responses generated by the agent\n", - "- **Ground Truth**: Expected outputs for comparison\n", - "\n", - "This preparation step is essential because RAGAS metrics were designed for traditional RAG systems, so we need to map our agent's tool-based architecture to the RAG paradigm for meaningful evaluation. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from notebooks.agents.utils import capture_tool_output_messages#, #extract_tool_results_only, get_final_agent_response, format_tool_outputs_for_display\n", - "\n", - "tool_messages = []\n", - "for i, row in vm_test_dataset._df.iterrows():\n", - " tool_message = \"\"\n", - " result = row['output']\n", - " # Capture all tool outputs and metadata\n", - " captured_data = capture_tool_output_messages(result)\n", - "\n", - " # Access specific tool outputs\n", - " for output in captured_data[\"tool_outputs\"]:\n", - " tool_message += output['content']\n", - " tool_messages.append([tool_message])\n", - "\n", - "vm_test_dataset._df['tool_messages'] = tool_messages" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset._df.head(2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Faithfulness\n", - "\n", - "Faithfulness measures how accurately the agent's responses reflect the information retrieved from tools. This metric evaluates:\n", - "\n", - "**Information Accuracy**: Whether the agent correctly uses tool outputs in its responses\n", - "- **Fact Preservation**: Ensuring numerical results, weather data, and document content are accurately reported\n", - "- **No Hallucination**: Verifying the agent doesn't invent information not provided by tools\n", - "- **Source Attribution**: Checking that responses align with actual tool outputs\n", - "\n", - "**Critical for Agent Trust**: Faithfulness is essential for agent reliability because users need to trust that:\n", - "- Calculator results are reported correctly\n", - "- Weather information is accurate \n", - "- Document searches return real information\n", - "- Validation results are properly communicated" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.Faithfulness\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"response_column\": [\"financial_model_prediction\"],\n", - " \"retrieved_contexts_column\": [\"tool_messages\"],\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Response Relevancy\n", - "\n", - "Response Relevancy evaluates how well the agent's answers address the user's original question or request. This metric assesses:\n", - "\n", - "**Query Alignment**: Whether responses directly answer what users asked for\n", - "- **Intent Fulfillment**: Checking if the agent understood and addressed the user's actual need\n", - "- **Completeness**: Ensuring responses provide sufficient information to satisfy the query\n", - "- **Focus**: Avoiding irrelevant information that doesn't help the user\n", - "\n", - "**Conversational Quality**: Measures the agent's ability to maintain relevant, helpful dialogue\n", - "- **Context Awareness**: Responses should be appropriate for the conversation context\n", - "- **User Satisfaction**: Answers should be useful and actionable for the user\n", - "- **Clarity**: Information should be presented in a way that directly helps the user\n", - "\n", - "High relevancy indicates the agent successfully understands user needs and provides targeted, helpful responses." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.ResponseRelevancy\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " params={\n", - " \"user_input_column\": \"input\",\n", - " \"response_column\": \"financial_model_prediction\",\n", - " \"retrieved_contexts_column\": \"tool_messages\",\n", - " }\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Context Recall\n", - "\n", - "Context Recall measures how well the agent utilizes the information retrieved from tools when generating its responses. This metric evaluates:\n", - "\n", - "**Information Utilization**: Whether the agent effectively incorporates tool outputs into its responses\n", - "- **Coverage**: How much of the available tool information is used in the response\n", - "- **Integration**: How well tool outputs are woven into coherent, natural responses\n", - "- **Completeness**: Whether all relevant information from tools is considered\n", - "\n", - "**Tool Effectiveness**: Assesses whether selected tools provide useful context for responses\n", - "- **Relevance**: Whether tool outputs actually help answer the user's question\n", - "- **Sufficiency**: Whether enough information was retrieved to generate good responses\n", - "- **Quality**: Whether the tools provided accurate, helpful information\n", - "\n", - "High context recall indicates the agent not only selects the right tools but also effectively uses their outputs to create comprehensive, well-informed responses." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.ContextRecall\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"retrieved_contexts_column\": [\"tool_messages\"],\n", - " \"reference_column\": [\"financial_model_prediction\"],\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### AspectCritic\n", - "\n", - "AspectCritic provides comprehensive evaluation across multiple dimensions of agent performance. This metric analyzes various aspects of response quality:\n", - "\n", - "**Multi-Dimensional Assessment**: Evaluates responses across different quality criteria\n", - "- **Helpfulness**: Whether responses genuinely assist users in accomplishing their goals\n", - "- **Relevance**: How well responses address the specific user query\n", - "- **Coherence**: Whether responses are logically structured and easy to follow\n", - "- **Correctness**: Accuracy of information and appropriateness of recommendations\n", - "\n", - "**Holistic Quality Scoring**: Provides an overall assessment that considers:\n", - "- **User Experience**: How satisfying and useful the interaction would be for real users\n", - "- **Professional Standards**: Whether responses meet quality expectations for production systems\n", - "- **Consistency**: Whether the agent maintains quality across different types of requests\n", - "\n", - "AspectCritic helps identify specific areas where the agent excels or needs improvement, providing actionable insights for enhancing overall performance and user satisfaction." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.AspectCritic\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"response_column\": [\"financial_model_prediction\"],\n", - " \"retrieved_contexts_column\": [\"tool_messages\"],\n", - " },\n", - ").log()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "ValidMind Library", - "language": "python", - "name": "validmind" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/notebooks/agents/langgraph_agent_simple_demo.ipynb b/notebooks/agents/langgraph_agent_simple_demo.ipynb deleted file mode 100644 index 24260c68b..000000000 --- a/notebooks/agents/langgraph_agent_simple_demo.ipynb +++ /dev/null @@ -1,1005 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "# Simplified LangGraph Agent Model Documentation\n", - "\n", - "This notebook demonstrates how to build and validate a simplified AI agent using LangGraph integrated with ValidMind for comprehensive testing and monitoring.\n", - "\n", - "Learn how to create intelligent agents that can:\n", - "- **Automatically select appropriate tools** based on user queries using LLM-powered routing\n", - "- **Manage workflows** with state management and memory\n", - "- **Handle two specialized tools** with smart decision-making\n", - "- **Provide validation and testing** through ValidMind integration\n", - "\n", - "We'll build a simplified agent system that intelligently routes user requests to two specialized tools: **search_engine** for document search and **task_assistant** for general assistance, then validate its performance using ValidMind's testing framework.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "vscode": { - "languageId": "raw" - } - }, - "source": [ - "## Setup and Imports\n", - "\n", - "First, let's import all the necessary libraries for building our LangGraph agent system:\n", - "\n", - "- **LangChain components** for LLM integration and tool management\n", - "- **LangGraph** for building stateful, multi-step agent workflows \n", - "- **ValidMind** for model validation and testing\n", - "- **Standard libraries** for data handling and environment management\n", - "\n", - "The setup includes loading environment variables (like OpenAI API keys) needed for the LLM components to function properly.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q langgraph langchain validmind openai" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from typing import TypedDict, Annotated, Sequence, Optional\n", - "from langchain.tools import tool\n", - "from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage\n", - "from langchain_openai import ChatOpenAI\n", - "from langgraph.graph import StateGraph, END, START\n", - "from langgraph.prebuilt import ToolNode\n", - "from langgraph.checkpoint.memory import MemorySaver\n", - "from langgraph.graph.message import add_messages\n", - "import pandas as pd\n", - "\n", - "# Load environment variables if using .env file\n", - "try:\n", - " from dotenv import load_dotenv\n", - " load_dotenv()\n", - "except ImportError:\n", - " print(\"dotenv not installed. Make sure OPENAI_API_KEY is set in your environment.\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import validmind as vm\n", - "\n", - "vm.init(\n", - " api_host=\"...\",\n", - " api_key=\"...\",\n", - " api_secret=\"...\",\n", - " model=\"...\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Simplified Tools with Rich Docstrings\n", - "\n", - "We've simplified the agent to use only two core tools:\n", - "- **search_engine**: For searching through documents, policies, and knowledge base \n", - "- **task_assistant**: For general-purpose task assistance and problem-solving\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Search Engine Tool\n", - "@tool\n", - "def search_engine(query: str, document_type: Optional[str] = \"all\") -> str:\n", - " \"\"\"\n", - " Search through internal documents, policies, and knowledge base.\n", - " \n", - " This tool can search for:\n", - " - Company policies and procedures\n", - " - Technical documentation and manuals\n", - " - Compliance and regulatory documents\n", - " - Historical records and reports\n", - " - Product specifications and requirements\n", - " - Legal documents and contracts\n", - " \n", - " Args:\n", - " query (str): Search terms or questions about documents\n", - " document_type (str, optional): Type of document to search (\"policy\", \"technical\", \"legal\", \"all\")\n", - " \n", - " Returns:\n", - " str: Relevant document excerpts and references\n", - " \n", - " Examples:\n", - " - \"Find our data privacy policy\"\n", - " - \"Search for loan approval procedures\"\n", - " - \"What are the security guidelines for API access?\"\n", - " - \"Show me compliance requirements for financial reporting\"\n", - " \"\"\"\n", - " document_db = {\n", - " \"policy\": [\n", - " \"Data Privacy Policy: All personal data must be encrypted...\",\n", - " \"Remote Work Policy: Employees may work remotely up to 3 days...\",\n", - " \"Security Policy: All systems require multi-factor authentication...\"\n", - " ],\n", - " \"technical\": [\n", - " \"API Documentation: REST endpoints available at /api/v1/...\",\n", - " \"Database Schema: User table contains id, name, email...\",\n", - " \"Deployment Guide: Use Docker containers with Kubernetes...\"\n", - " ],\n", - " \"legal\": [\n", - " \"Terms of Service: By using this service, you agree to...\",\n", - " \"Privacy Notice: We collect information to provide services...\",\n", - " \"Compliance Framework: SOX requirements mandate quarterly audits...\"\n", - " ]\n", - " }\n", - " \n", - " results = []\n", - " search_types = [document_type] if document_type != \"all\" else document_db.keys()\n", - " \n", - " for doc_type in search_types:\n", - " if doc_type in document_db:\n", - " for doc in document_db[doc_type]:\n", - " if any(term.lower() in doc.lower() for term in query.split()):\n", - " results.append(f\"[{doc_type.upper()}] {doc}\")\n", - " \n", - " if not results:\n", - " results.append(f\"No documents found matching '{query}'\")\n", - " \n", - " return \"\\n\\n\".join(results)\n", - "\n", - "# Task Assistant Tool\n", - "@tool\n", - "def task_assistant(task_description: str, context: Optional[str] = None) -> str:\n", - " \"\"\"\n", - " General-purpose task assistance and problem-solving tool.\n", - " \n", - " This tool can help with:\n", - " - Breaking down complex tasks into steps\n", - " - Providing guidance and recommendations\n", - " - Answering questions and explaining concepts\n", - " - Suggesting solutions to problems\n", - " - Planning and organizing activities\n", - " - Research and information gathering\n", - " \n", - " Args:\n", - " task_description (str): Description of the task or question\n", - " context (str, optional): Additional context or background information\n", - " \n", - " Returns:\n", - " str: Helpful guidance, steps, or information for the task\n", - " \n", - " Examples:\n", - " - \"How do I prepare for a job interview?\"\n", - " - \"What are the steps to deploy a web application?\"\n", - " - \"Help me plan a team meeting agenda\"\n", - " - \"Explain machine learning concepts for beginners\"\n", - " \"\"\"\n", - " responses = {\n", - " \"meeting\": \"For planning meetings: 1) Define objectives, 2) Create agenda, 3) Invite participants, 4) Prepare materials, 5) Set time limits\",\n", - " \"interview\": \"Interview preparation: 1) Research the company, 2) Practice common questions, 3) Prepare examples, 4) Plan your outfit, 5) Arrive early\",\n", - " \"deploy\": \"Deployment steps: 1) Test in staging, 2) Backup production, 3) Deploy code, 4) Run health checks, 5) Monitor performance\",\n", - " \"learning\": \"Learning approach: 1) Start with basics, 2) Practice regularly, 3) Build projects, 4) Join communities, 5) Stay updated\"\n", - " }\n", - " \n", - " task_lower = task_description.lower()\n", - " for key, response in responses.items():\n", - " if key in task_lower:\n", - " return f\"Task assistance for '{task_description}':\\n\\n{response}\"\n", - " \n", - " \n", - " return f\"\"\"For the task '{task_description}', I recommend: 1) Break it into smaller steps, 2) Gather necessary resources, 3)\n", - " Create a timeline, 4) Start with the most critical parts, 5) Review and adjust as needed.\n", - " \"\"\"\n", - "\n", - "# Collect all tools for the LLM router - SIMPLIFIED TO ONLY 2 TOOLS\n", - "AVAILABLE_TOOLS = [\n", - " search_engine,\n", - " task_assistant\n", - "]\n", - "\n", - "print(\"Simplified tools created!\")\n", - "print(f\"Available tools: {len(AVAILABLE_TOOLS)}\")\n", - "for tool in AVAILABLE_TOOLS:\n", - " print(f\" - {tool.name}: {tool.description[:50]}...\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Complete LangGraph Agent with Intelligent Router\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "# Simplified Agent State (removed routing fields)\n", - "class IntelligentAgentState(TypedDict):\n", - " messages: Annotated[Sequence[BaseMessage], add_messages]\n", - " user_input: str\n", - " session_id: str\n", - " context: dict\n", - "\n", - "def create_intelligent_langgraph_agent():\n", - " \"\"\"Create a simplified LangGraph agent with direct LLM tool selection.\"\"\"\n", - " \n", - " # Initialize the main LLM for responses\n", - " main_llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0.7)\n", - " \n", - " # Bind tools to the main LLM\n", - " llm_with_tools = main_llm.bind_tools(AVAILABLE_TOOLS)\n", - " \n", - " def llm_node(state: IntelligentAgentState) -> IntelligentAgentState:\n", - " \"\"\"Main LLM node that processes requests and directly selects tools.\"\"\"\n", - " \n", - " messages = state[\"messages\"]\n", - " \n", - " # Enhanced system prompt with tool selection guidance\n", - " system_context = f\"\"\"You are a helpful AI assistant with access to specialized tools.\n", - " Analyze the user's request and directly use the most appropriate tools to help them.\n", - " \n", - " AVAILABLE TOOLS:\n", - " 🔍 **search_engine** - Search through internal documents, policies, and knowledge base\n", - " - Use for: finding company policies, technical documentation, compliance documents\n", - " - Examples: \"Find our data privacy policy\", \"Search for API documentation\"\n", - "\n", - " 🎯 **task_assistant** - General-purpose task assistance and problem-solving \n", - " - Use for: guidance, recommendations, explaining concepts, planning activities\n", - " - Examples: \"How to prepare for an interview\", \"Help plan a meeting\", \"Explain machine learning\"\n", - "\n", - " INSTRUCTIONS:\n", - " - Analyze the user's request carefully\n", - " - If they need to find documents/policies → use search_engine\n", - " - If they need general help/guidance/explanations → use task_assistant \n", - " - If the request needs specific information search, use search_engine first\n", - " - You can use tools directly based on the user's needs\n", - " - Provide helpful, accurate responses based on tool outputs\n", - " - If no tools are needed, respond conversationally\n", - "\n", - " Choose and use tools wisely to provide the most helpful response.\"\"\"\n", - " \n", - " # Add system context to messages\n", - " enhanced_messages = [SystemMessage(content=system_context)] + list(messages)\n", - " \n", - " # Get LLM response with tool selection\n", - " response = llm_with_tools.invoke(enhanced_messages)\n", - " \n", - " return {\n", - " **state,\n", - " \"messages\": messages + [response]\n", - " }\n", - " \n", - " def should_continue(state: IntelligentAgentState) -> str:\n", - " \"\"\"Decide whether to use tools or end the conversation.\"\"\"\n", - " last_message = state[\"messages\"][-1]\n", - " \n", - " # Check if the LLM wants to use tools\n", - " if hasattr(last_message, 'tool_calls') and last_message.tool_calls:\n", - " return \"tools\"\n", - " \n", - " return END\n", - " \n", - " \n", - " # Create the simplified state graph \n", - " workflow = StateGraph(IntelligentAgentState)\n", - " \n", - " # Add nodes (removed router node)\n", - " workflow.add_node(\"llm\", llm_node) \n", - " workflow.add_node(\"tools\", ToolNode(AVAILABLE_TOOLS))\n", - " \n", - " # Simplified entry point - go directly to LLM\n", - " workflow.add_edge(START, \"llm\")\n", - " \n", - " # From LLM, decide whether to use tools or end\n", - " workflow.add_conditional_edges(\n", - " \"llm\",\n", - " should_continue,\n", - " {\"tools\": \"tools\", END: END}\n", - " )\n", - " \n", - " # Tool execution flows back to LLM for final response\n", - " workflow.add_edge(\"tools\", \"llm\")\n", - " \n", - " # Set up memory\n", - " memory = MemorySaver()\n", - " \n", - " # Compile the graph\n", - " agent = workflow.compile(checkpointer=memory)\n", - " \n", - " return agent\n", - "\n", - "# Create the simplified intelligent agent\n", - "intelligent_agent = create_intelligent_langgraph_agent()\n", - "\n", - "print(\"Simplified LangGraph Agent Created!\")\n", - "print(\"Features:\")\n", - "print(\" - Direct LLM tool selection (no separate router)\")\n", - "print(\" - Enhanced system prompt for intelligent tool choice\")\n", - "print(\" - Streamlined workflow: LLM -> Tools -> Response\")\n", - "print(\" - Automatic tool parameter extraction\")\n", - "print(\" - Clean, simplified architecture\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## ValidMind Model Integration\n", - "\n", - "Now we'll integrate our LangGraph agent with ValidMind for comprehensive testing and validation. This step is crucial for:\n", - "\n", - "**Model Wrapping**: We create a wrapper function (`agent_fn`) that standardizes the agent interface for ValidMind\n", - "- **Input Formatting**: Converts ValidMind inputs to the agent's expected format\n", - "- **State Management**: Handles session configuration and conversation threads\n", - "- **Result Processing**: Returns agent responses in a consistent format\n", - "\n", - "**ValidMind Agent Initialization**: Using `vm.init_model()` creates a ValidMind model object that:\n", - "- **Enables Testing**: Allows us to run validation tests on the agent\n", - "- **Tracks Performance**: Monitors agent behavior and responses \n", - "- **Provides Documentation**: Generates documentation and analysis reports\n", - "- **Supports Evaluation**: Enables quantitative assessment of agent capabilities\n", - "\n", - "This integration allows us to treat our LangGraph agent like any other machine learning model in the ValidMind ecosystem, enabling comprehensive testing and validation workflows." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def agent_fn(input):\n", - " \"\"\"\n", - " Invoke the simplified agent with the given input.\n", - " \"\"\"\n", - " # Simplified initial state (removed routing fields)\n", - " initial_state = {\n", - " \"user_input\": input[\"input\"],\n", - " \"messages\": [HumanMessage(content=input[\"input\"])],\n", - " \"session_id\": input[\"session_id\"],\n", - " \"context\": {}\n", - " }\n", - "\n", - " session_config = {\"configurable\": {\"thread_id\": input[\"session_id\"]}}\n", - "\n", - " result = intelligent_agent.invoke(initial_state, config=session_config)\n", - "\n", - " return {\"prediction\": result['messages'][-1].content, \"output\": result}\n", - "\n", - "\n", - "vm_intelligent_model = vm.init_model(input_id=\"financial_model\", predict_fn=agent_fn)\n", - "# add model to the vm agent\n", - "vm_intelligent_model.model = intelligent_agent" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prepare Sample Test Dataset\n", - "\n", - "We'll create a comprehensive test dataset to evaluate our agent's performance across different scenarios. This dataset includes:\n", - "\n", - "**Diverse Test Cases**: Various types of user requests that test different agent capabilities:\n", - "- **Single Tool Requests**: Simple queries that require one specific tool\n", - "- **Multi-Tool Requests**: Complex queries requiring multiple tools in sequence \n", - "- **Validation Tasks**: Requests for data validation and verification\n", - "- **General Assistance**: Open-ended questions for problem-solving guidance\n", - "\n", - "**Expected Outputs**: For each test case, we define:\n", - "- **Expected Tools**: Which tools should be selected by the router\n", - "- **Possible Outputs**: Valid response patterns or values\n", - "- **Session IDs**: Unique identifiers for conversation tracking\n", - "\n", - "This structured approach allows us to systematically evaluate both tool selection accuracy and response quality." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import uuid\n", - "\n", - "# Simplified test dataset with only search_engine and task_assistant tools\n", - "test_dataset = pd.DataFrame([\n", - " {\n", - " \"input\": \"Find our company's data privacy policy\",\n", - " \"expected_tools\": [\"search_engine\"],\n", - " \"possible_outputs\": [\"privacy_policy.pdf\", \"data_protection.doc\", \"company_privacy_guidelines.txt\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Search for loan approval procedures\", \n", - " \"expected_tools\": [\"search_engine\"],\n", - " \"possible_outputs\": [\"loan_procedures.doc\", \"approval_process.pdf\", \"lending_guidelines.txt\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"How should I prepare for a technical interview?\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"algorithms\", \"data structures\", \"system design\", \"coding practice\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Help me understand machine learning basics\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"supervised\", \"unsupervised\", \"neural networks\", \"training\", \"testing\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"What can you do for me?\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"search documents\", \"provide assistance\", \"answer questions\", \"help with tasks\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Find technical documentation about API endpoints\",\n", - " \"expected_tools\": [\"search_engine\"],\n", - " \"possible_outputs\": [\"API_documentation.pdf\", \"REST_endpoints.doc\", \"technical_guide.txt\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " },\n", - " {\n", - " \"input\": \"Help me plan a team meeting agenda\",\n", - " \"expected_tools\": [\"task_assistant\"],\n", - " \"possible_outputs\": [\"objectives\", \"agenda\", \"participants\", \"materials\", \"time limits\"],\n", - " \"session_id\": str(uuid.uuid4())\n", - " }\n", - "])\n", - "\n", - "print(\"Simplified test dataset created!\")\n", - "print(f\"Number of test cases: {len(test_dataset)}\")\n", - "print(f\"Test tools: {test_dataset['expected_tools'].explode().unique()}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Display the simplified test dataset\n", - "print(\"Using simplified test dataset with only 2 tools:\")\n", - "print(f\"Number of test cases: {len(test_dataset)}\")\n", - "print(f\"Available tools being tested: {sorted(test_dataset['expected_tools'].explode().unique())}\")\n", - "print(\"\\nTest cases preview:\")\n", - "for i, row in test_dataset.iterrows():\n", - " print(f\"{i+1}. {row['input']} -> Expected tool: {row['expected_tools'][0]}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Initialize ValidMind Dataset\n", - "\n", - "Before we can run tests and evaluations, we need to initialize our test dataset as a ValidMind dataset object. \n", - "This step is essential for integrating our agent evaluation into ValidMind's comprehensive testing and validation framework.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset = vm.init_dataset(\n", - " input_id=\"test_dataset\",\n", - " dataset=test_dataset,\n", - " target_column=\"possible_outputs\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Run Agent and Assign Predictions\n", - "\n", - "Now we'll execute our agent on the test dataset and capture its responses for evaluation. This process generates the prediction data needed for comprehensive performance evaluation and comparison against expected outputs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.assign_predictions(vm_intelligent_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Dataframe display settings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pd.set_option('display.max_colwidth', 40)\n", - "pd.set_option('display.width', 120)\n", - "pd.set_option('display.max_colwidth', None)\n", - "vm_test_dataset._df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Visualization\n", - "This section visualizes the LangGraph agent's workflow structure using Mermaid diagrams.\n", - "The test below validates that the agent's architecture is properly structured by:\n", - "- Checking if the model has a valid LangGraph Graph object\n", - "- Generating a visual representation of component connections and flow\n", - "- Ensuring the graph can be properly rendered as a Mermaid diagram\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import langgraph\n", - "\n", - "@vm.test(\"my_custom_tests.LangGraphVisualization\")\n", - "def LangGraphVisualization(model):\n", - " \"\"\"\n", - " Visualizes the LangGraph workflow structure using Mermaid diagrams.\n", - " \n", - " ### Purpose\n", - " Creates a visual representation of the LangGraph agent's workflow using Mermaid diagrams\n", - " to show the connections and flow between different components. This helps validate that\n", - " the agent's architecture is properly structured.\n", - " \n", - " ### Test Mechanism\n", - " 1. Retrieves the graph representation from the model using get_graph()\n", - " 2. Attempts to render it as a Mermaid diagram\n", - " 3. Returns the visualization and validation results\n", - " \n", - " ### Signs of High Risk\n", - " - Failure to generate graph visualization indicates potential structural issues\n", - " - Missing or broken connections between components\n", - " - Invalid graph structure that cannot be rendered\n", - " \"\"\"\n", - " try:\n", - " if not hasattr(model, 'model') or not isinstance(model.model, langgraph.graph.state.CompiledStateGraph):\n", - " return {\n", - " 'test_results': False,\n", - " 'summary': {\n", - " 'status': 'FAIL', \n", - " 'details': 'Model must have a LangGraph Graph object as model attribute'\n", - " }\n", - " }\n", - " graph = model.model.get_graph(xray=False)\n", - " mermaid_png = graph.draw_mermaid_png()\n", - " return mermaid_png\n", - " except Exception as e:\n", - " return {\n", - " 'test_results': False, \n", - " 'summary': {\n", - " 'status': 'FAIL',\n", - " 'details': f'Failed to generate graph visualization: {str(e)}'\n", - " }\n", - " }\n", - "\n", - "vm.tests.run_test(\n", - " \"my_custom_tests.LangGraphVisualization\",\n", - " inputs = {\n", - " \"model\": vm_intelligent_model\n", - " }\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Accuracy Test\n", - "The purpose of this test is to evaluate the agent's ability to provide accurate responses by:\n", - "- Testing against a dataset of predefined questions and expected answers\n", - "- Checking if responses contain expected keywords\n", - "- Providing detailed test results including pass/fail status\n", - "- Helping identify any gaps in the agent's knowledge or response quality" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import validmind as vm\n", - "\n", - "@vm.test(\"my_custom_tests.accuracy_test\")\n", - "def accuracy_test(model, dataset, list_of_columns):\n", - " \"\"\"\n", - " Run tests on a dataset of questions and expected responses.\n", - " Optimized version using vectorized operations and list comprehension.\n", - " \"\"\"\n", - " df = dataset._df\n", - " \n", - " # Pre-compute responses for all tests\n", - " y_true = dataset.y.tolist()\n", - " y_pred = dataset.y_pred(model).tolist()\n", - "\n", - " # Vectorized test results\n", - " test_results = []\n", - " for response, keywords in zip(y_pred, y_true):\n", - " test_results.append(any(str(keyword).lower() in str(response).lower() for keyword in keywords))\n", - " \n", - " results = pd.DataFrame()\n", - " column_names = [col + \"_details\" for col in list_of_columns]\n", - " results[column_names] = df[list_of_columns]\n", - " results[\"actual\"] = y_pred\n", - " results[\"expected\"] = y_true\n", - " results[\"passed\"] = test_results\n", - " results[\"error\"] = None if test_results else f'Response did not contain any expected keywords: {y_true}'\n", - " \n", - " return results\n", - " \n", - "result = vm.tests.run_test(\n", - " \"my_custom_tests.accuracy_test\",\n", - " inputs={\n", - " \"dataset\": vm_test_dataset,\n", - " \"model\": vm_intelligent_model\n", - " },\n", - " params={\n", - " \"list_of_columns\": [\"input\"]\n", - " }\n", - ")\n", - "result.log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Tool Call Accuracy Test\n", - "\n", - "This test evaluates how accurately our intelligent router selects the correct tools for different user requests. This test provides quantitative feedback on the agent's core intelligence - its ability to understand what users need and select the right tools to help them." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import validmind as vm\n", - "\n", - "# Test with a real LangGraph result instead of creating mock objects\n", - "@vm.test(\"my_custom_tests.ToolCallAccuracy\")\n", - "def ToolCallAccuracy(dataset, agent_output_column, expected_tools_column):\n", - " \"\"\"Test validation using actual LangGraph agent results.\"\"\"\n", - " # Let's create a simpler validation without the complex RAGAS setup\n", - " def validate_tool_calls_simple(messages, expected_tools):\n", - " \"\"\"Simple validation of tool calls without RAGAS dependency issues.\"\"\"\n", - " \n", - " tool_calls_found = []\n", - " \n", - " for message in messages:\n", - " if hasattr(message, 'tool_calls') and message.tool_calls:\n", - " for tool_call in message.tool_calls:\n", - " # Handle both dictionary and object formats\n", - " if isinstance(tool_call, dict):\n", - " tool_calls_found.append(tool_call['name'])\n", - " else:\n", - " # ToolCall object - use attribute access\n", - " tool_calls_found.append(tool_call.name)\n", - " \n", - " # Check if expected tools were called\n", - " accuracy = 0.0\n", - " matches = 0\n", - " if expected_tools:\n", - " matches = sum(1 for tool in expected_tools if tool in tool_calls_found)\n", - " accuracy = matches / len(expected_tools)\n", - " \n", - " return {\n", - " 'accuracy': accuracy,\n", - " 'expected_tools': expected_tools,\n", - " 'found_tools': tool_calls_found,\n", - " 'matches': matches,\n", - " 'total_expected': len(expected_tools) if expected_tools else 0\n", - " }\n", - "\n", - " df = dataset._df\n", - " \n", - " results = []\n", - " for i, row in df.iterrows():\n", - " result = validate_tool_calls_simple(row[agent_output_column]['messages'], row[expected_tools_column])\n", - " results.append(result)\n", - " \n", - " return results\n", - "\n", - "vm.tests.run_test(\n", - " \"my_custom_tests.ToolCallAccuracy\",\n", - " inputs = {\n", - " \"dataset\": vm_test_dataset,\n", - " },\n", - " params = {\n", - " \"agent_output_column\": \"output\",\n", - " \"expected_tools_column\": \"expected_tools\"\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## RAGAS Tests for Agent Evaluation\n", - "\n", - "RAGAS (Retrieval-Augmented Generation Assessment) provides specialized metrics for evaluating conversational AI systems like our LangGraph agent. These tests analyze different aspects of agent performance:\n", - "\n", - "Our agent uses tools to retrieve information (weather, documents, calculations) and generates responses based on that context, making it similar to a RAG system. RAGAS metrics help evaluate:\n", - "\n", - "- **Response Quality**: How well the agent uses retrieved tool outputs to generate helpful responses\n", - "- **Information Faithfulness**: Whether agent responses accurately reflect tool outputs \n", - "- **Relevance Assessment**: How well responses address the original user query\n", - "- **Context Utilization**: How effectively the agent incorporates tool results into final answers\n", - "\n", - "These tests provide insights into how well our agent integrates tool usage with conversational abilities, ensuring it provides accurate, relevant, and helpful responses to users.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Dataset Preparation - Extract Context from Agent State\n", - "\n", - "Before running RAGAS tests, we need to extract and prepare the context information from our agent's execution results. This process:\n", - "\n", - "**Tool Output Extraction**: Retrieves the outputs from tools used during agent execution\n", - "- **Message Parsing**: Analyzes the agent's conversation state to find tool outputs\n", - "- **Content Aggregation**: Combines outputs from multiple tools when used in sequence\n", - "- **Context Formatting**: Structures tool outputs as context for RAGAS evaluation\n", - "\n", - "**RAGAS Format Preparation**: Converts agent data into the format expected by RAGAS metrics\n", - "- **User Input**: Original user queries from the test dataset\n", - "- **Retrieved Context**: Tool outputs treated as \"retrieved\" information \n", - "- **Agent Response**: Final responses generated by the agent\n", - "- **Ground Truth**: Expected outputs for comparison\n", - "\n", - "This preparation step is essential because RAGAS metrics were designed for traditional RAG systems, so we need to map our agent's tool-based architecture to the RAG paradigm for meaningful evaluation. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from utils import capture_tool_output_messages\n", - "\n", - "tool_messages = []\n", - "for i, row in vm_test_dataset._df.iterrows():\n", - " tool_message = \"\"\n", - " result = row['output']\n", - " # Capture all tool outputs and metadata\n", - " captured_data = capture_tool_output_messages(result)\n", - " \n", - " # Access specific tool outputs\n", - " for output in captured_data[\"tool_outputs\"]:\n", - " tool_message += output['content']\n", - " tool_messages.append([tool_message])\n", - "\n", - "vm_test_dataset._df['tool_messages'] = tool_messages" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset._df.head(2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Faithfulness\n", - "\n", - "Faithfulness measures how accurately the agent's responses reflect the information retrieved from tools. This metric evaluates:\n", - "\n", - "**Information Accuracy**: Whether the agent correctly uses tool outputs in its responses\n", - "- **Fact Preservation**: Ensuring numerical results, weather data, and document content are accurately reported\n", - "- **No Hallucination**: Verifying the agent doesn't invent information not provided by tools\n", - "- **Source Attribution**: Checking that responses align with actual tool outputs\n", - "\n", - "**Critical for Agent Trust**: Faithfulness is essential for agent reliability because users need to trust that:\n", - "- Calculator results are reported correctly\n", - "- Weather information is accurate \n", - "- Document searches return real information\n", - "- Validation results are properly communicated" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.Faithfulness\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"response_column\": [\"financial_model_prediction\"],\n", - " \"retrieved_contexts_column\": [\"tool_messages\"],\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Response Relevancy\n", - "\n", - "Response Relevancy evaluates how well the agent's answers address the user's original question or request. This metric assesses:\n", - "\n", - "**Query Alignment**: Whether responses directly answer what users asked for\n", - "- **Intent Fulfillment**: Checking if the agent understood and addressed the user's actual need\n", - "- **Completeness**: Ensuring responses provide sufficient information to satisfy the query\n", - "- **Focus**: Avoiding irrelevant information that doesn't help the user\n", - "\n", - "**Conversational Quality**: Measures the agent's ability to maintain relevant, helpful dialogue\n", - "- **Context Awareness**: Responses should be appropriate for the conversation context\n", - "- **User Satisfaction**: Answers should be useful and actionable for the user\n", - "- **Clarity**: Information should be presented in a way that directly helps the user\n", - "\n", - "High relevancy indicates the agent successfully understands user needs and provides targeted, helpful responses." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.ResponseRelevancy\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " params={\n", - " \"user_input_column\": \"input\",\n", - " \"response_column\": \"financial_model_prediction\",\n", - " \"retrieved_contexts_column\": \"tool_messages\",\n", - " }\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Context Recall\n", - "\n", - "Context Recall measures how well the agent utilizes the information retrieved from tools when generating its responses. This metric evaluates:\n", - "\n", - "**Information Utilization**: Whether the agent effectively incorporates tool outputs into its responses\n", - "- **Coverage**: How much of the available tool information is used in the response\n", - "- **Integration**: How well tool outputs are woven into coherent, natural responses\n", - "- **Completeness**: Whether all relevant information from tools is considered\n", - "\n", - "**Tool Effectiveness**: Assesses whether selected tools provide useful context for responses\n", - "- **Relevance**: Whether tool outputs actually help answer the user's question\n", - "- **Sufficiency**: Whether enough information was retrieved to generate good responses\n", - "- **Quality**: Whether the tools provided accurate, helpful information\n", - "\n", - "High context recall indicates the agent not only selects the right tools but also effectively uses their outputs to create comprehensive, well-informed responses." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.ContextRecall\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"retrieved_contexts_column\": [\"tool_messages\"],\n", - " \"reference_column\": [\"financial_model_prediction\"],\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### AspectCritic\n", - "\n", - "AspectCritic provides comprehensive evaluation across multiple dimensions of agent performance. This metric analyzes various aspects of response quality:\n", - "\n", - "**Multi-Dimensional Assessment**: Evaluates responses across different quality criteria\n", - "- **Helpfulness**: Whether responses genuinely assist users in accomplishing their goals\n", - "- **Relevance**: How well responses address the specific user query\n", - "- **Coherence**: Whether responses are logically structured and easy to follow\n", - "- **Correctness**: Accuracy of information and appropriateness of recommendations\n", - "\n", - "**Holistic Quality Scoring**: Provides an overall assessment that considers:\n", - "- **User Experience**: How satisfying and useful the interaction would be for real users\n", - "- **Professional Standards**: Whether responses meet quality expectations for production systems\n", - "- **Consistency**: Whether the agent maintains quality across different types of requests\n", - "\n", - "AspectCritic helps identify specific areas where the agent excels or needs improvement, providing actionable insights for enhancing overall performance and user satisfaction." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.AspectCritic\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"response_column\": [\"financial_model_prediction\"],\n", - " \"retrieved_contexts_column\": [\"tool_messages\"],\n", - " },\n", - ").log()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "ValidMind Library", - "language": "python", - "name": "validmind" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/notebooks/code_samples/agents/banking_test_dataset.py b/notebooks/code_samples/agents/banking_test_dataset.py new file mode 100644 index 000000000..ade54e754 --- /dev/null +++ b/notebooks/code_samples/agents/banking_test_dataset.py @@ -0,0 +1,161 @@ +import pandas as pd +import uuid + +# Banking-specific test dataset for retail and commercial banking +# tools: credit_risk_analyzer, customer_account_manager, fraud_detection_system +banking_test_dataset = pd.DataFrame([ + { + "input": "Analyze credit risk for a $50,000 personal loan application with $75,000 annual income, $1,200 monthly debt, and 720 credit score", + "expected_tools": ["credit_risk_analyzer"], + "possible_outputs": ["LOW RISK", "MEDIUM RISK", "APPROVE", "debt-to-income ratio", "19.2%", "risk score", "720", "probability of default", "2.5%"], + "session_id": str(uuid.uuid4()), + "category": "credit_risk" + }, + { + "input": "Evaluate credit risk for a business loan of $250,000 with monthly revenue of $85,000 and existing debt of $45,000", + "expected_tools": ["credit_risk_analyzer"], + "possible_outputs": ["MEDIUM RISK", "HIGH RISK", "business loan", "debt service coverage ratio", "1.8", "annual revenue", "$1,020,000", "risk score", "650"], + "session_id": str(uuid.uuid4()), + "category": "credit_risk" + }, + { + "input": "Check account balance for checking account 12345", + "expected_tools": ["customer_account_manager"], + "possible_outputs": ["balance", "$3,247.82", "account information", "John Smith", "checking account", "available balance", "$3,047.82"], + "session_id": str(uuid.uuid4()), + "category": "account_management" + }, + { + "input": "Analyze fraud risk for a $15,000 wire transfer from customer 67890 to Nigeria", + "expected_tools": ["fraud_detection_system"], + "possible_outputs": ["HIGH RISK", "fraud score", "87", "geographic risk", "95%", "amount", "$15,000", "block transaction", "confidence", "92%"], + "session_id": str(uuid.uuid4()), + "category": "fraud_detection" + }, + { + "input": "Recommend banking products for customer 11111 with $150,000 in savings and 720 credit score", + "expected_tools": ["customer_account_manager"], + "possible_outputs": ["product recommendations", "premium accounts", "investment services", "line of credit", "$50,000", "savings rate", "4.25%"], + "session_id": str(uuid.uuid4()), + "category": "account_management" + }, + { + "input": "Investigate suspicious transactions totaling $75,000 across multiple accounts in the last week", + "expected_tools": ["fraud_detection_system"], + "possible_outputs": ["suspicious activity", "pattern analysis", "transaction monitoring", "VERY HIGH RISK", "alert", "fraud score", "94", "total amount", "$75,000"], + "session_id": str(uuid.uuid4()), + "category": "fraud_detection" + }, + { + "input": "Assess credit risk for a $1,000,000 commercial real estate loan with $500,000 annual business income", + "expected_tools": ["credit_risk_analyzer"], + "possible_outputs": ["HIGH RISK", "VERY HIGH RISK", "business loan", "commercial", "risk assessment", "loan-to-value", "66.7%", "debt service coverage", "2.0"], + "session_id": str(uuid.uuid4()), + "category": "credit_risk" + }, + { + "input": "Process a $2,500 deposit to savings account 67890", + "expected_tools": ["customer_account_manager"], + "possible_outputs": ["transaction processed", "deposit", "$2,500", "new balance", "$15,847.32", "transaction ID", "TXN-789456123"], + "session_id": str(uuid.uuid4()), + "category": "account_management" + }, + { + "input": "Review credit card application for customer with 580 credit score, $42,000 annual income, and recent bankruptcy", + "expected_tools": ["credit_risk_analyzer"], + "possible_outputs": ["VERY HIGH RISK", "DECLINE", "bankruptcy", "credit score", "580", "probability of default", "35%", "debt-to-income", "78%"], + "session_id": str(uuid.uuid4()), + "category": "credit_risk" + }, + { + "input": "Update customer contact information and address for account holder 22334", + "expected_tools": ["customer_account_manager"], + "possible_outputs": ["customer updated", "address change", "contact information", "profile updated", "customer ID", "22334"], + "session_id": str(uuid.uuid4()), + "category": "account_management" + }, + { + "input": "Detect potential fraud in multiple small transactions under $500 happening rapidly from different locations", + "expected_tools": ["fraud_detection_system"], + "possible_outputs": ["velocity fraud", "geographic anomaly", "HIGH RISK", "transaction pattern", "card fraud", "velocity score", "89", "locations", "4"], + "session_id": str(uuid.uuid4()), + "category": "fraud_detection" + }, + { + "input": "Close dormant account 98765 and transfer remaining balance to active checking account", + "expected_tools": ["customer_account_manager"], + "possible_outputs": ["account closed", "balance transfer", "$487.63", "dormant account", "transaction completed", "account ID", "98765"], + "session_id": str(uuid.uuid4()), + "category": "account_management" + }, + { + "input": "Assess credit risk for auto loan of $35,000 for customer with 650 credit score, $55,000 income, and no previous auto loans", + "expected_tools": ["credit_risk_analyzer"], + "possible_outputs": ["MEDIUM RISK", "auto loan", "first-time borrower", "acceptable risk", "interest rate", "6.75%", "monthly payment", "$574"], + "session_id": str(uuid.uuid4()), + "category": "credit_risk" + }, + { + "input": "Flag unusual ATM withdrawals of $500 every hour for the past 6 hours from account 44556", + "expected_tools": ["fraud_detection_system"], + "possible_outputs": ["velocity pattern", "ATM fraud", "HIGH RISK", "card compromise", "unusual pattern", "total withdrawn", "$3,000", "frequency", "6", "transactions"], + "session_id": str(uuid.uuid4()), + "category": "fraud_detection" + }, + { + "input": "Open new business checking account for LLC with $25,000 initial deposit and setup online banking", + "expected_tools": ["customer_account_manager"], + "possible_outputs": ["business account", "new account", "online banking setup", "LLC registration", "account opened", "initial deposit", "$25,000", "account number", "987654321"], + "session_id": str(uuid.uuid4()), + "category": "account_management" + }, + { + "input": "Evaluate creditworthiness for student loan refinancing of $85,000 with recent graduation and $65,000 starting salary", + "expected_tools": ["credit_risk_analyzer"], + "possible_outputs": ["student loan", "refinancing", "MEDIUM RISK", "recent graduate", "debt consolidation", "new rate", "4.5%", "monthly payment", "$878"], + "session_id": str(uuid.uuid4()), + "category": "credit_risk" + }, + { + "input": "Investigate merchant transactions showing unusual chargeback patterns and potential money laundering", + "expected_tools": ["fraud_detection_system"], + "possible_outputs": ["merchant fraud", "chargeback analysis", "money laundering", "VERY HIGH RISK", "compliance alert", "chargeback rate", "15.3%", "risk score", "96"], + "session_id": str(uuid.uuid4()), + "category": "fraud_detection" + }, + { + "input": "Set up automatic bill pay for customer 77889 for utilities, mortgage, and insurance payments", + "expected_tools": ["customer_account_manager"], + "possible_outputs": ["automatic payments", "bill pay setup", "recurring transactions", "payment scheduling", "total monthly", "$2,847", "customer ID", "77889"], + "session_id": str(uuid.uuid4()), + "category": "account_management" + }, + { + "input": "Analyze credit risk for line of credit increase from $10,000 to $25,000 for existing customer with payment history", + "expected_tools": ["credit_risk_analyzer"], + "possible_outputs": ["credit limit increase", "LOW RISK", "payment history", "existing customer", "new limit", "$25,000", "utilization", "12%"], + "session_id": str(uuid.uuid4()), + "category": "credit_risk" + }, + { + "input": "Review suspicious cryptocurrency exchange transactions totaling $200,000 over 3 days from business account", + "expected_tools": ["fraud_detection_system"], + "possible_outputs": ["cryptocurrency", "large transactions", "business account", "HIGH RISK", "regulatory concern", "total amount", "$200,000", "risk score", "91"], + "session_id": str(uuid.uuid4()), + "category": "fraud_detection" + }, + { + "input": "Process stop payment request for check #1234 and issue new checks for customer account 55667", + "expected_tools": ["customer_account_manager"], + "possible_outputs": ["stop payment", "check services", "new checks", "payment blocked", "customer service", "check amount", "$1,247.50", "account", "55667"], + "session_id": str(uuid.uuid4()), + "category": "account_management" + }, + { + "input": "Evaluate mortgage pre-approval for $450,000 home purchase with 20% down payment, 780 credit score, and $125,000 household income", + "expected_tools": ["credit_risk_analyzer"], + "possible_outputs": ["mortgage pre-approval", "LOW RISK", "excellent credit", "strong income", "home purchase", "approved amount", "$450,000", "interest rate", "3.75%", "monthly payment", "$2,083"], + "session_id": str(uuid.uuid4()), + "category": "credit_risk" + } +]) diff --git a/notebooks/code_samples/agents/banking_tools.py b/notebooks/code_samples/agents/banking_tools.py new file mode 100644 index 000000000..b26eab060 --- /dev/null +++ b/notebooks/code_samples/agents/banking_tools.py @@ -0,0 +1,497 @@ +from typing import Optional +from datetime import datetime +from langchain.tools import tool + + +def _score_dti_ratio(dti_ratio: float) -> int: + """Score based on debt-to-income ratio.""" + if dti_ratio <= 28: + return 25 + elif dti_ratio <= 36: + return 20 + elif dti_ratio <= 43: + return 15 + else: + return 5 + + +def _score_credit_score(credit_score: int) -> int: + """Score based on credit score.""" + if credit_score >= 750: + return 25 + elif credit_score >= 700: + return 20 + elif credit_score >= 650: + return 15 + elif credit_score >= 600: + return 10 + else: + return 5 + + +def _score_loan_amount(loan_amount: float, monthly_income: float) -> int: + """Score based on loan amount relative to income.""" + if loan_amount <= monthly_income * 12: + return 25 + elif loan_amount <= monthly_income * 18: + return 20 + elif loan_amount <= monthly_income * 24: + return 15 + else: + return 10 + + +def _classify_risk(risk_score: int) -> tuple[str, str]: + """Classify risk level and recommendation based on score.""" + if risk_score >= 70: + return "LOW RISK", "APPROVE with standard terms" + elif risk_score >= 50: + return "MEDIUM RISK", "APPROVE with enhanced monitoring" + elif risk_score >= 30: + return "HIGH RISK", "REQUIRES additional documentation" + else: + return "VERY HIGH RISK", "RECOMMEND DENIAL" + + +def _get_dti_description(dti_ratio: float) -> str: + """Get description for DTI ratio.""" + if dti_ratio <= 28: + return "excellent" + elif dti_ratio <= 36: + return "good" + elif dti_ratio <= 43: + return "acceptable" + else: + return "concerning" + + +def _get_credit_description(credit_score: int) -> str: + """Get description for credit score.""" + if credit_score >= 750: + return "excellent" + elif credit_score >= 700: + return "good" + elif credit_score >= 650: + return "fair" + else: + return "poor" + + +# Credit Risk Analyzer Tool +@tool +def credit_risk_analyzer( + customer_income: float, + customer_debt: float, + credit_score: int, + loan_amount: float, + loan_type: str = "personal" +) -> str: + """ + Analyze credit risk for loan applications and credit decisions. + + This tool evaluates: + - Debt-to-income ratio analysis + - Credit score assessment + - Loan-to-value calculations + - Risk scoring and recommendations + - Regulatory compliance checks + + Args: + customer_income (float): Annual income in USD + customer_debt (float): Total monthly debt payments in USD + credit_score (int): FICO credit score (300-850) + loan_amount (float): Requested loan amount in USD + loan_type (str): Type of loan (personal, mortgage, business, auto) + + Returns: + str: Comprehensive credit risk analysis and recommendations + + Examples: + - "Analyze credit risk for $50k personal loan" + - "Assess mortgage eligibility for $300k home purchase" + - "Calculate risk score for business loan application" + """ + # Calculate debt-to-income ratio + monthly_income = customer_income / 12 + dti_ratio = (customer_debt / monthly_income) * 100 + + # Calculate risk score using helper functions + risk_score = (_score_dti_ratio(dti_ratio) + + _score_credit_score(credit_score) + + _score_loan_amount(loan_amount, monthly_income)) + + # Get risk classification + risk_level, recommendation = _classify_risk(risk_score) + + return f"""CREDIT RISK ANALYSIS REPORT + ================================ + + Customer Profile: + - Annual Income: ${customer_income:,.2f} + - Monthly Debt: ${customer_debt:,.2f} + - Credit Score: {credit_score} + - Loan Request: ${loan_amount:,.2f} ({loan_type}) + + Risk Assessment: + - Debt-to-Income Ratio: {dti_ratio:.1f}% + - Risk Score: {risk_score}/75 + - Risk Level: {risk_level} + + Recommendation: {recommendation} + + Additional Notes: + - DTI ratio of {dti_ratio:.1f}% is {_get_dti_description(dti_ratio)} + - Credit score of {credit_score} is {_get_credit_description(credit_score)} + - Loan amount represents {((loan_amount / customer_income) * 100):.1f}% of annual income + """ + + +def _get_customer_database(): + """Get mock customer database.""" + return { + "12345": { + "name": "John Smith", + "checking_balance": 2547.89, + "savings_balance": 12500.00, + "credit_score": 745, + "account_age_days": 450 + }, + "67890": { + "name": "Sarah Johnson", + "checking_balance": 892.34, + "savings_balance": 3500.00, + "credit_score": 680, + "account_age_days": 180 + }, + "11111": { + "name": "Business Corp LLC", + "checking_balance": 45000.00, + "savings_balance": 150000.00, + "credit_score": 720, + "account_age_days": 730 + } + } + + +def _handle_check_balance(customer, account_type, customer_id): + """Handle balance check action.""" + if account_type == "checking": + balance = customer["checking_balance"] + elif account_type == "savings": + balance = customer["savings_balance"] + else: + return f"Account type '{account_type}' not supported for balance check." + + return f"""ACCOUNT BALANCE REPORT + ================================ + + Customer: {customer['name']} + Account Type: {account_type.title()} + Account ID: {customer_id} + + Current Balance: ${balance:,.2f} + Last Updated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} + + Account Status: Active + """ + + +def _handle_process_transaction(customer, account_type, amount, customer_id): + """Handle transaction processing action.""" + if amount is None: + return "Amount is required for transaction processing." + + if account_type == "checking": + current_balance = customer["checking_balance"] + if amount > 0: # Deposit + new_balance = current_balance + amount + transaction_type = "DEPOSIT" + else: # Withdrawal + if abs(amount) > current_balance: + return f"Insufficient funds. Available balance: ${current_balance:,.2f}" + new_balance = current_balance + amount # amount is negative + transaction_type = "WITHDRAWAL" + + # Update mock database + customer["checking_balance"] = new_balance + + return f"""TRANSACTION PROCESSED + ================================ + + Customer: {customer['name']} + Account: {account_type.title()} - {customer_id} + Transaction: {transaction_type} + Amount: ${abs(amount):,.2f} + + Previous Balance: ${current_balance:,.2f} + New Balance: ${new_balance:,.2f} + Transaction ID: TX{datetime.now().strftime('%Y%m%d%H%M%S')} + + Status: Completed + """ + + +def _get_product_recommendations(credit_score): + """Get product recommendations based on credit score.""" + if credit_score >= 700: + return [ + "Premium Checking Account with no monthly fees", + "High-Yield Savings Account (2.5% APY)", + "Personal Line of Credit up to $25,000", + "Investment Advisory Services" + ] + elif credit_score >= 650: + return [ + "Standard Checking Account", + "Basic Savings Account (1.2% APY)", + "Secured Credit Card", + "Debt Consolidation Loan" + ] + else: + return [ + "Second Chance Checking Account", + "Basic Savings Account (0.5% APY)", + "Secured Credit Card", + "Credit Building Services" + ] + + +def _handle_recommend_product(customer): + """Handle product recommendation action.""" + recommendations = _get_product_recommendations(customer["credit_score"]) + + return f"""PRODUCT RECOMMENDATIONS + ================================ + + Customer: {customer['name']} + Credit Score: {customer['credit_score']} + Account Age: {customer['account_age_days']} days + + Recommended Products: + {chr(10).join(f" • {rec}" for rec in recommendations)} + + Next Steps: + - Schedule consultation with relationship manager + - Review product terms and conditions + - Complete application process + """ + + +def _handle_get_info(customer, customer_id): + """Handle get info action.""" + credit_tier = ('Excellent' if customer['credit_score'] >= 750 else + 'Good' if customer['credit_score'] >= 700 else + 'Fair' if customer['credit_score'] >= 650 else 'Poor') + + return f"""CUSTOMER ACCOUNT INFORMATION + ================================ + + Customer ID: {customer_id} + Name: {customer['name']} + Account Age: {customer['account_age_days']} days + + Account Balances: + - Checking: ${customer['checking_balance']:,.2f} + - Savings: {customer['savings_balance']:,.2f} + + Credit Profile: + - Credit Score: {customer['credit_score']} + - Credit Tier: {credit_tier} + + Services Available: + - Online Banking + - Mobile App + - Bill Pay + - Direct Deposit + """ + + +# Customer Account Manager Tool +@tool +def customer_account_manager( + account_type: str, + customer_id: str, + action: str, + amount: Optional[float] = None, + account_details: Optional[str] = None +) -> str: + """ + Manage customer accounts and provide banking services. + + This tool handles: + - Account information and balances + - Transaction processing + - Product recommendations + - Customer service inquiries + - Account maintenance + + Args: + account_type (str): Type of account (checking, savings, loan, credit_card) + customer_id (str): Customer identifier + action (str): Action to perform (check_balance, process_transaction, recommend_product, get_info) + amount (float, optional): Transaction amount for financial actions + account_details (str, optional): Additional account information + + Returns: + str: Account information or transaction results + + Examples: + - "Check balance for checking account 12345" + - "Process $500 deposit to savings account 67890" + - "Recommend products for customer with high balance" + - "Get account information for loan account 11111" + """ + customer_db = _get_customer_database() + + if customer_id not in customer_db: + return f"Customer ID {customer_id} not found in system." + + customer = customer_db[customer_id] + + if action == "check_balance": + return _handle_check_balance(customer, account_type, customer_id) + elif action == "process_transaction": + return _handle_process_transaction(customer, account_type, amount, customer_id) + elif action == "recommend_product": + return _handle_recommend_product(customer) + elif action == "get_info": + return _handle_get_info(customer, customer_id) + else: + return f"Action '{action}' not supported. Available actions: check_balance, process_transaction, recommend_product, get_info" + + +# Fraud Detection System Tool +@tool +def fraud_detection_system( + transaction_id: str, + customer_id: str, + transaction_amount: float, + transaction_type: str, + location: str, + device_id: Optional[str] = None +) -> str: + """ + Analyze transactions for potential fraud and security risks. + + This tool evaluates: + - Transaction patterns and anomalies + - Geographic risk assessment + - Device fingerprinting + - Behavioral analysis + - Risk scoring and alerts + + Args: + transaction_id (str): Unique transaction identifier + customer_id (str): Customer account identifier + transaction_amount (float): Transaction amount in USD + transaction_type (str): Type of transaction (purchase, withdrawal, transfer, deposit) + location (str): Transaction location or IP address + device_id (str, optional): Device identifier for mobile/online transactions + + Returns: + str: Fraud risk assessment and recommendations + + Examples: + - "Analyze fraud risk for $500 ATM withdrawal in Miami" + - "Check security for $2000 online purchase from new device" + - "Assess risk for $10000 wire transfer to international account" + """ + + # Mock fraud detection logic + risk_score = 0 + risk_factors = [] + recommendations = [] + + # Amount-based risk + if transaction_amount > 10000: + risk_score += 30 + risk_factors.append("High-value transaction (>$10k)") + recommendations.append("Require additional verification") + + if transaction_amount > 1000: + risk_score += 15 + risk_factors.append("Medium-value transaction (>$1k)") + + # Location-based risk + high_risk_locations = ["Nigeria", "Russia", "North Korea", "Iran", "Cuba"] + if any(country in location for country in high_risk_locations): + risk_score += 40 + risk_factors.append("High-risk geographic location") + recommendations.append("Block transaction - high-risk country") + + # Transaction type risk + if transaction_type == "withdrawal" and transaction_amount > 5000: + risk_score += 25 + risk_factors.append("Large cash withdrawal") + recommendations.append("Require in-person verification") + + if transaction_type == "transfer" and transaction_amount > 5000: + risk_score += 20 + risk_factors.append("Large transfer") + recommendations.append("Implement 24-hour delay for verification") + + # Device risk + if device_id and device_id.startswith("UNKNOWN"): + risk_score += 25 + risk_factors.append("Unknown or new device") + recommendations.append("Require multi-factor authentication") + + # Time-based risk (mock: assume night transactions are riskier) + current_hour = datetime.now().hour + if 22 <= current_hour or current_hour <= 6: + risk_score += 10 + risk_factors.append("Unusual transaction time") + + # Risk classification + if risk_score >= 70: + risk_level = "HIGH RISK" + action = "BLOCK TRANSACTION" + elif risk_score >= 40: + risk_level = "MEDIUM RISK" + action = "REQUIRE VERIFICATION" + else: + risk_level = "LOW RISK" + action = "ALLOW TRANSACTION" + + return f"""FRAUD DETECTION ANALYSIS + ================================ + + Transaction Details: + - Transaction ID: {transaction_id} + - Customer ID: {customer_id} + - Amount: ${transaction_amount:,.2f} + - Type: {transaction_type.title()} + - Location: {location} + - Device: {device_id or 'N/A'} + + Risk Assessment: {risk_level} + - Risk Score: {risk_score}/100 + - Risk Factors: {len(risk_factors)} + + Identified Risk Factors: + {chr(10).join(f" • {factor}" for factor in risk_factors)} + + Recommendations: + {chr(10).join(f" • {rec}" for rec in recommendations) if recommendations else " • No additional actions required"} + + Decision: {action} + + Next Steps: + - Log risk assessment in fraud monitoring system + - Update customer risk profile if necessary + - Monitor for similar patterns + """ + + +# Export all banking tools +AVAILABLE_TOOLS = [ + credit_risk_analyzer, + customer_account_manager, + fraud_detection_system +] + +if __name__ == "__main__": + print("Banking-specific tools created!") + print(f"Available tools: {len(AVAILABLE_TOOLS)}") + for banking_tool in AVAILABLE_TOOLS: + print(f" - {banking_tool.name}: {banking_tool.description[:80]}...") diff --git a/notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb b/notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb new file mode 100644 index 000000000..e92bc3d65 --- /dev/null +++ b/notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb @@ -0,0 +1,1094 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# AI Agent Validation with ValidMind - Banking Demo\n", + "\n", + "This notebook shows how to document and evaluate an agentic AI system with the ValidMind Library. Using a small banking agent built in LangGraph as an example, you will run ValidMind’s built-in and custom tests and produce the artifacts needed to create evidence-backed documentation.\n", + "\n", + "An AI agent is an autonomous system that interprets inputs, selects from available tools or actions, and carries out multi-step behaviors to achieve user goals. In this example, our agent acts as a professional banking assistant that analyzes user requests and automatically selects and invokes the most appropriate specialized banking tool (credit, account, or fraud) to deliver accurate, compliant, and actionable responses.\n", + "\n", + "However, agentic capabilities bring concrete risks. The agent may misinterpret user inputs or fail to extract required parameters, producing incorrect credit assessments or inappropriate account actions; it can select the wrong tool (for example, invoking account management instead of fraud detection), which may cause unsafe, non-compliant, or customer-impacting behaviour.\n", + "\n", + "This interactive notebook guides you step-by-step through building a demo LangGraph banking agent, preparing an evaluation dataset, initializing the ValidMind Library and required objects, writing custom tests for tool-selection accuracy and entity extraction, running ValidMind’s built-in and custom test suites, and logging documentation artifacts to ValidMind.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Table of Contents\n", + "\n", + "\n", + "- [About ValidMind](#about-validmind)\n", + " - [Before you begin](#before-you-begin)\n", + " - [New to ValidMind?](#new-to-validmind)\n", + " - [Key concepts](#key-concepts)\n", + "- [Install the ValidMind Library](#install-the-validmind-library)\n", + "- [Initialize the ValidMind Library](#initialize-the-validmind-library)\n", + " - [Get your code snippet](#get-your-code-snippet)\n", + " - [Initialize the Python environment](#initialize-the-python-environment)\n", + "- [Banking Tools](#banking-tools)\n", + " - [Tool Overview](#tool-overview)\n", + " - [Test Banking Tools Individually](#test-banking-tools-individually)\n", + "- [Complete LangGraph Banking Agent](#complete-langgraph-banking-agent)\n", + "- [ValidMind Model Integration](#validmind-model-integration)\n", + "- [Prompt Validation](#prompt-validation)\n", + "- [Banking Test Dataset](#banking-test-dataset)\n", + " - [Initialize ValidMind Dataset](#initialize-validmind-dataset)\n", + " - [Run the Agent and capture result through assign predictions](#run-the-agent-and-capture-result-through-assign-predictions)\n", + "- [Banking Accuracy Test](#banking-accuracy-test)\n", + "- [Banking Tool Call Accuracy Test](#banking-tool-call-accuracy-test)\n", + "- [RAGAS Tests for an Agent Evaluation](#ragas-tests-for-an-agent-evaluation)\n", + " - [Faithfulness](#faithfulness)\n", + " - [Response Relevancy](#response-relevancy)\n", + " - [Context Recall](#context-recall)\n", + "- [Safety](#safety)\n", + " - [AspectCritic](#aspectcritic)\n", + " - [Prompt bias](#prompt-bias)\n", + " - [Toxicity](#toxicity)\n", + "- [Demo Summary and Next Steps](#demo-summary-and-next-steps)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## About ValidMind\n", + "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.\n", + "\n", + "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.\n", + "\n", + "### Before you begin\n", + "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.\n", + "\n", + "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).\n", + "\n", + "### New to ValidMind?\n", + "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", + "\n", + "