Skip to content
/ wm2 Public

AI-powered supply chain tool for predicting product weights and dimensions from text descriptions

Notifications You must be signed in to change notification settings

EvieHwang/wm2

Repository files navigation

WM2: ASRS Storage Classifier

CI Deploy

An AI-powered product classification system that automatically assigns warehouse storage categories based on text descriptions. WM2 combines Claude's reasoning capabilities with tool-augmented retrieval to classify products into optimal ASRS (Automated Storage and Retrieval System) container types.

Live Demo


Table of Contents


What It Does

WM2 solves the problem of automatically categorizing warehouse products into the right storage containers. Given a product description—anything from "USB-C cable, 6ft" to "Industrial hydraulic pump, 45 lbs, 18x12x14 inches"—the system determines which ASRS container type can safely and efficiently hold that product.

The classifier operates as an agentic system: rather than making a single inference call, it can autonomously decide to use tools to gather additional information before making its classification decision. This tool-augmented approach allows it to:

  1. Look up known products in a reference database of 479 items using semantic search
  2. Extract explicit dimensions from descriptions using regex pattern matching
  3. Apply constraint-based rules to select the smallest container that fits

Each classification includes a confidence tier and reasoning, making the system's decision-making transparent and auditable.


Classification Categories

Products are classified into one of five container types, each with specific dimension and weight constraints:

Category Max Dimensions (L×W×H) Max Weight Typical Products
Pouch 12" × 9" × 2" 1 lb Small electronics, cables, jewelry
Small Bin 12" × 9" × 6" 10 lbs Books, small tools, packaged goods
Tote 18" × 14" × 12" 50 lbs Appliances, bulk items, equipment
Carton 24" × 18" × 18" 70 lbs Large equipment, furniture parts
Oversized No limits No limits Items exceeding carton constraints

The classifier always selects the smallest container that fits—a product that could fit in a Tote won't be assigned to a Carton. All three dimensions matter: a thin but long item might skip Small Bin entirely if its length exceeds the constraint.

Confidence Tiers

Each classification includes a confidence tier indicating the system's certainty:

Tier Meaning Typical Scenarios
HIGH Strong conviction Explicit dimensions provided, exact database match
MEDIUM_HIGH Fairly confident Good database match, clear product type
MEDIUM Notable uncertainty Estimated from product category
MEDIUM_LOW Significant doubt Vague description, ambiguous product type
LOW Best guess Very limited information available

Architecture

WM2 runs on AWS serverless infrastructure, with Claude handling the classification logic through an agentic loop:

flowchart TB
    subgraph Client["Client"]
        FE[S3 Static Site<br/>Frontend]
    end

    subgraph AWS["AWS Cloud"]
        APIGW[API Gateway<br/>REST + CORS + Rate Limiting]

        subgraph Lambda["Lambda Container"]
            H[Handler<br/>Input Validation]
            FB[Feedback Retrieval]
            CL[Claude Agent<br/>Haiku Model]

            subgraph Tools["Tool Execution"]
                T1[lookup_known_product]
                T2[extract_explicit_dimensions]
            end
        end

        subgraph Storage["Storage Layer"]
            DDB[(DynamoDB<br/>Feedback)]
            S3D[(S3<br/>Reference Data)]
        end

        subgraph Search["Semantic Search"]
            EMB[Embedding Service<br/>all-MiniLM-L6-v2]
            CHR[(ChromaDB<br/>Vector Index)]
        end
    end

    subgraph External["External"]
        ANTH[Anthropic API]
        ARIZE[Arize Phoenix<br/>Observability]
    end

    FE -->|POST /classify| APIGW
    APIGW --> H
    H --> FB
    FB --> DDB
    FB --> CL
    CL <-->|Messages API| ANTH
    CL -->|tool_use| T1
    CL -->|tool_use| T2
    T1 --> EMB
    EMB --> CHR
    T1 --> S3D
    CHR -.->|Index from| S3D
    CL -.->|Traces| ARIZE
    H -->|Response| APIGW
    APIGW --> FE
Loading

Request Flow:

  1. User submits a product description via the frontend
  2. API Gateway validates the request and applies rate limiting
  3. Lambda handler validates input (guardrails) and retrieves relevant feedback from DynamoDB
  4. Claude Agent receives the description with few-shot examples from feedback history
  5. Agent decides whether to invoke tools based on the input
  6. If tools are used, results are fed back to Claude for synthesis
  7. Claude returns a structured classification with confidence and reasoning
  8. Response passes through output guardrails before returning to client

Agentic Behavior

WM2's classifier operates as an autonomous agent that decides its own execution path. Rather than following a fixed pipeline, Claude dynamically chooses which tools to invoke (if any) based on the input description.

The Agentic Loop

flowchart TD
    START([Product Description]) --> INIT[Initialize Agent]
    INIT --> FEEDBACK[Retrieve Feedback<br/>for Few-Shot Context]
    FEEDBACK --> CLAUDE[Send to Claude<br/>with Tool Definitions]

    CLAUDE --> DECISION{Claude Response}

    DECISION -->|tool_use| TOOL_CHECK{Tool Call<br/>Limit OK?}
    TOOL_CHECK -->|Yes| EXECUTE[Execute Tool]
    TOOL_CHECK -->|No, > 3 calls| BLOCK[Block: Too Complex]

    EXECUTE --> LOOKUP{Which Tool?}
    LOOKUP -->|lookup_known_product| SEMANTIC[Semantic Search<br/>ChromaDB]
    LOOKUP -->|extract_explicit_dimensions| REGEX[Regex Parsing<br/>Dimensions/Weight]

    SEMANTIC --> RESULTS[Add Results<br/>to Messages]
    REGEX --> RESULTS
    RESULTS --> ITER_CHECK{Iteration < 5?}
    ITER_CHECK -->|Yes| CLAUDE
    ITER_CHECK -->|No| FORCE_END[Force End Turn]

    DECISION -->|end_turn| PARSE[Parse JSON Response]
    FORCE_END --> PARSE

    PARSE --> VALIDATE[Output Guardrails<br/>Validate Category/Confidence]
    VALIDATE --> RESPONSE([Classification Result])

    BLOCK --> ERROR([Error Response])
Loading

Tool Definitions

The agent has access to exactly two tools:

lookup_known_product

Searches a reference database of 479 known products using semantic similarity:

# Tool definition (simplified)
{
    "name": "lookup_known_product",
    "description": "Search the reference database for known products matching the query",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Product name or description to search for"
            }
        },
        "required": ["query"]
    }
}

How it works:

  1. Query is embedded using all-MiniLM-L6-v2 (384-dimensional vectors)
  2. ChromaDB performs approximate nearest neighbor search
  3. Results are re-ranked using hybrid scoring (semantic + keyword overlap)
  4. Top matches are returned with product name, dimensions, weight, and similarity score

extract_explicit_dimensions

Parses dimension and weight measurements from text using regex patterns:

# Tool definition (simplified)
{
    "name": "extract_explicit_dimensions",
    "description": "Extract explicit dimensions (LxWxH) and weight from text",
    "input_schema": {
        "type": "object",
        "properties": {
            "text": {
                "type": "string",
                "description": "Text containing dimension/weight information"
            }
        },
        "required": ["text"]
    }
}

Supported formats:

  • Dimensions: 10x8x4, 10"x8"x4", 10 x 8 x 4 inches, 10cm × 8cm × 4cm
  • Weight: 5 lbs, 2.5 kg, 16 oz, 500g

When Tools Are Invoked

The agent autonomously decides whether to use tools based on context:

Scenario Typical Agent Behavior
"iPhone 15 Pro Max" Calls lookup_known_product → likely finds exact match
"Box, 10x8x4 inches, 5 lbs" Calls extract_explicit_dimensions → uses explicit measurements
"Small USB cable" May call lookup_known_product, or classify directly as POUCH
"Industrial widget, model X-500" Calls lookup_known_product → may find similar products
"24x18x16 shipping container, 45 lbs" Calls extract_explicit_dimensions → CARTON based on dimensions

Most classifications complete in 0-2 tool calls. The behavioral guardrails limit tool calls to 3 per request to prevent pathological looping.


Feedback Memory System

WM2 learns from user corrections through a feedback loop that stores thumbs up/down responses and retrieves them as few-shot examples for future classifications.

flowchart LR
    subgraph Classification["Classification Request"]
        DESC[Product Description]
        KW[Extract Keywords]
    end

    subgraph Retrieval["Two-Tier Retrieval"]
        RECENT[Recent Tier<br/>Last 10 entries]
        KEYWORD[Keyword Tier<br/>Semantic relevance]
    end

    subgraph Context["Prompt Context"]
        DEDUP[Deduplicate]
        FORMAT[Format as<br/>Few-Shot Examples]
    end

    subgraph Storage["DynamoDB"]
        DDB[(Feedback Table)]
    end

    DESC --> KW
    KW --> KEYWORD
    DDB --> RECENT
    DDB --> KEYWORD
    RECENT --> DEDUP
    KEYWORD --> DEDUP
    DEDUP --> FORMAT
    FORMAT --> CLAUDE[Claude Agent]

    CLAUDE -->|Classification| USER[User]
    USER -->|"👍 / 👎"| STORE[Store Feedback]
    STORE --> DDB
Loading

How It Works

Storage (on feedback submission):

  1. User submits thumbs up (correct) or thumbs down (incorrect) for a classification
  2. System extracts keywords from the product description
  3. Feedback is stored in DynamoDB with: description, classification, correctness, keywords, timestamp

Retrieval (on classification request):

  1. Recency tier: Fetches the 10 most recent feedback entries (captures recent corrections)
  2. Keyword tier: Finds entries with overlapping keywords (captures domain-relevant examples)
  3. Results are deduplicated and formatted as few-shot examples in the system prompt

Prompt injection format:

## User Feedback History

The following are previous classifications with user feedback:

✓ CORRECT: "iPhone 15 Pro case" → POUCH
✗ INCORRECT: "Large toolbox set" → SMALL_BIN (user indicated this was wrong)
✓ CORRECT: "Hydraulic pump, 12x10x8, 25 lbs" → TOTE

This approach allows the model to learn from corrections without retraining, adapting its behavior based on accumulated feedback.


Guardrails Pipeline

WM2 implements a three-stage guardrails architecture following the GUARD (Generalized Unified Agent Risk Defense) framework. Guardrails operate at runtime to prevent harm and waste, complementing the offline evaluation system that measures quality.

flowchart LR
    subgraph Input["Input Stage"]
        I1[Length Check<br/>5-2000 chars]
        I2[Empty/Whitespace<br/>Validation]
        I3[JSON Parse<br/>Validation]
        I4[Rate Limiting<br/>30 req/sec]
    end

    subgraph Behavioral["Behavioral Stage"]
        B1[Tool Call Limit<br/>Max 3 calls]
        B2[Iteration Limit<br/>Max 5 loops]
        B3[Request Timeout<br/>25 seconds]
        B4[Fixed Tool Set<br/>2 tools only]
    end

    subgraph Output["Output Stage"]
        O1[Category Validation<br/>Must be valid]
        O2[Confidence Validation<br/>Must be valid tier]
        O3[Reasoning Truncation<br/>Max 500 chars]
    end

    REQUEST([Request]) --> Input
    Input -->|Pass| Behavioral
    Input -->|Fail| BLOCK1([400 Error])

    Behavioral -->|Pass| CLAUDE[Claude Agent]
    Behavioral -->|Fail| BLOCK2([Error Response])

    CLAUDE --> Output
    Output -->|Pass| RESPONSE([Classification])
    Output -->|Fail| BLOCK3([500 Error])
Loading

Input Stage Guardrails

Protect the system before Claude sees the request:

Guardrail Detection Response Threshold
Minimum description length Deterministic Block (400) < 5 characters
Maximum description length Deterministic Block (400) > 2000 characters
Empty/whitespace check Deterministic Block (400) Empty or whitespace-only
JSON parse validation Deterministic Block (400) Invalid JSON body
API Gateway rate limiting Deterministic Block (429) > 30 req/sec, 50 burst

Behavioral Stage Guardrails

Prevent runaway execution during the agentic loop:

Guardrail Detection Response Threshold
Tool call limit Counter Block with error > 3 tool calls per request
Iteration limit Counter Force end turn > 5 agentic loop iterations
Request timeout Timer Graceful error > 25 seconds (API client)
Lambda timeout AWS Hard termination > 30 seconds
Fixed tool set Allowlist Error result Unknown tool name

Output Stage Guardrails

Validate Claude's response before returning to the client:

Guardrail Detection Response Threshold
Valid category check Enum validation Block (raises error) Category not in valid set
Valid confidence tier Enum validation Block (raises error) Tier not in valid set
Reasoning truncation Length check Truncate with "..." > 500 characters

Structural Guardrails

Beyond runtime checks, the architecture provides inherent protections:

  • Fixed tool set: The agent can only use 2 pre-defined tools—no arbitrary actions possible
  • Single model: Hardcoded to Claude Haiku—no model escalation, predictable cost
  • Lambda isolation: No network egress except to Anthropic API—no data exfiltration
  • Structured output: JSON extraction with fallback parsing—consistent response format

Evaluation Framework

WM2 includes a comprehensive local evaluation system for measuring classifier quality. The evaluation framework operates offline (batch analysis) to complement the runtime guardrails.

flowchart TB
    subgraph Dataset["Test Dataset"]
        CSV[wm2_eval_v1.csv<br/>60 labeled cases]
    end

    subgraph Runner["Evaluation Runner"]
        LOAD[Load Dataset]
        CLASSIFY[Run Classifier]
        EVALUATE[Apply Evaluators]
        AGGREGATE[Compute Metrics]
    end

    subgraph Evaluators["8 APF Evaluators"]
        subgraph Gates["Layer 1: Gates"]
            E1[valid_category]
            E2[has_reasoning]
            E3[valid_confidence_tier]
        end

        subgraph Pillars["Layer 2: Pillars"]
            E4[fit_accuracy<br/>Effectiveness]
            E5[strict_accuracy<br/>Effectiveness]
            E6[latency_acceptable<br/>Efficiency]
            E7[overconfident_failure<br/>Reliability]
            E8[safety_weight_check<br/>Trustworthiness]
        end
    end

    subgraph Storage["SQLite Storage"]
        DB[(eval_results.db)]
        SCHEMA[EvalRun + EvalResult<br/>+ EvaluatorResult]
    end

    subgraph Presentation["Visualization"]
        CLI[CLI Commands]
        DASH[Streamlit Dashboard]
    end

    CSV --> LOAD
    LOAD --> CLASSIFY
    CLASSIFY --> EVALUATE
    EVALUATE --> Gates
    EVALUATE --> Pillars
    Gates --> AGGREGATE
    Pillars --> AGGREGATE
    AGGREGATE --> DB
    DB --> CLI
    DB --> DASH
Loading

Evaluation Philosophy

The evaluation system follows the Agent Performance Framework (APF) with three layers:

Layer 1 - Binary Gates (must-pass):

  • valid_category: Output must be one of the 5 valid categories
  • has_reasoning: Response must include non-empty reasoning
  • valid_confidence_tier: Confidence must be a valid tier

Layer 2 - Pillar Metrics:

  • Effectiveness: fit_accuracy (asymmetric—over-prediction acceptable), strict_accuracy (exact match)
  • Efficiency: latency_acceptable (< 5 seconds)
  • Reliability: overconfident_failure (penalizes HIGH confidence wrong answers)
  • Trustworthiness: safety_weight_check (weight within category limits)

Provenance Tracking

Every evaluation run captures:

  • Git commit hash and branch name
  • Timestamp
  • Dataset name and version
  • Per-case results with all evaluator scores
  • Aggregate summary metrics

This enables questions like "What changed between these two runs?" and "Which commit introduced this regression?"

CLI Usage

# Run an evaluation
python -m eval.cli run --name "baseline-v1"

# View summary of latest run
python -m eval.cli summary

# List recent runs
python -m eval.cli list --last 10

# Compare two runs
python -m eval.cli compare <run-id-1> <run-id-2>

Streamlit Dashboard

The dashboard provides interactive visualization of evaluation results:

streamlit run dashboard/app.py

Pages:

  • Overview: Run summary with all evaluator scores and visual indicators
  • Failures: Drill-down into failed cases with filtering by evaluator and category
  • Compare: Side-by-side run comparison with improvement/regression highlighting
  • History: Historical trends with interactive charts

Arize Phoenix Integration

Production traces are sent to Arize Phoenix for observability:

  • Request/response tracing for all Claude API calls
  • Latency metrics and token usage
  • Tool call patterns and outcomes

Tracing is fail-open—if credentials are missing, requests proceed without observability.


Development Setup

Installation

Clone the repository and install in development mode:

git clone https://github.com/EvieHwang/wm2.git
cd wm2
pip install -e .

This installs the project in editable mode, making all imports work correctly without PYTHONPATH manipulation.

Optional Dependencies

For semantic search capabilities (local development):

pip install -e ".[semantic]"

For development tools (testing, linting):

pip install -e ".[dev]"

Running Tests

# Run all tests
python -m pytest

# Run only eval tests
python -m pytest eval/tests/

# Run only backend tests
python -m pytest backend/tests/

# Run with coverage
python -m pytest --cov=backend --cov=eval

Running the Evaluation CLI

# Run an evaluation
python -m eval.cli run --name "baseline-v1"

# View summary of latest run
python -m eval.cli summary

# List recent runs
python -m eval.cli list --last 10

# Compare two runs
python -m eval.cli compare <run-id-1> <run-id-2>

Running the Dashboard

streamlit run dashboard/app.py
# Opens browser to http://localhost:8501

API Reference

POST /classify

Classify a product description into an ASRS container type.

Request:

{
  "description": "iPhone 15 Pro Max, 6.7 inch display, 221g"
}

Response:

{
  "classification": "POUCH",
  "confidence": "HIGH",
  "reasoning": "The iPhone 15 Pro Max is a smartphone with known dimensions approximately 6.3 x 3.0 x 0.3 inches and weight of 0.49 lbs, which fits comfortably within Pouch constraints.",
  "tools_used": {
    "lookup_known_product": {
      "called": true,
      "result": "Found iPhone 15 Pro Max in reference database"
    },
    "extract_explicit_dimensions": {
      "called": false
    }
  },
  "extracted_measurements": {
    "length": 6.3,
    "width": 3.0,
    "height": 0.3,
    "weight": 0.49,
    "source": "reference"
  },
  "latency_ms": 1247
}

POST /v1/feedback

Submit feedback on a classification result.

Request:

{
  "description": "iPhone 15 Pro Max, 6.7 inch display, 221g",
  "classification": "POUCH",
  "is_correct": true
}

Response:

{
  "status": "success",
  "message": "Feedback recorded"
}

GET /health

Health check endpoint.

Response:

{
  "status": "healthy"
}

Deployment

CI/CD Pipeline

This project uses GitHub Actions for continuous integration and deployment:

  • CI: Runs on all PRs and pushes to main (lint, test, security scan, dependency audit)
  • Deploy: Automatically deploys to AWS on merge to main

Required GitHub Secrets

Secret Description
AWS_ACCESS_KEY_ID AWS IAM access key with Lambda, ECR, S3, CodeBuild permissions
AWS_SECRET_ACCESS_KEY AWS IAM secret key
ANTHROPIC_API_KEY Anthropic API key for running tests

Manual Deployment

Prerequisites: AWS CLI configured, SAM CLI installed, Anthropic API key

# Backend
cd backend
sam build
sam deploy --guided --parameter-overrides AnthropicApiKey=your-key

# Frontend (update API_BASE_URL in app.js first)
aws s3 mb s3://your-bucket-name
aws s3 website s3://your-bucket-name --index-document index.html
aws s3 sync ../frontend s3://your-bucket-name

Tech Stack

Component Technology
Runtime Python 3.12, AWS Lambda
API API Gateway (REST), CORS enabled
AI/ML Claude API (Anthropic), sentence-transformers
Storage DynamoDB (feedback), S3 (reference data, frontend)
Search ChromaDB (vector embeddings), all-MiniLM-L6-v2
Deployment AWS SAM, GitHub Actions
Evaluation SQLite, Streamlit, Arize Phoenix
Frontend Vanilla HTML/CSS/JavaScript

License

MIT License - See LICENSE for details.

About

AI-powered supply chain tool for predicting product weights and dimensions from text descriptions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •