WM2: ASRS Storage Classifier

An AI-powered product classification system that automatically assigns warehouse storage categories based on text descriptions. WM2 combines Claude's reasoning capabilities with tool-augmented retrieval to classify products into optimal ASRS (Automated Storage and Retrieval System) container types.

Live Demo

What It Does

WM2 solves the problem of automatically categorizing warehouse products into the right storage containers. Given a product description—anything from "USB-C cable, 6ft" to "Industrial hydraulic pump, 45 lbs, 18x12x14 inches"—the system determines which ASRS container type can safely and efficiently hold that product.

The classifier operates as an agentic system: rather than making a single inference call, it can autonomously decide to use tools to gather additional information before making its classification decision. This tool-augmented approach allows it to:

Look up known products in a reference database of 479 items using semantic search
Extract explicit dimensions from descriptions using regex pattern matching
Apply constraint-based rules to select the smallest container that fits

Each classification includes a confidence tier and reasoning, making the system's decision-making transparent and auditable.

Classification Categories

Products are classified into one of five container types, each with specific dimension and weight constraints:

Category	Max Dimensions (L×W×H)	Max Weight	Typical Products
Pouch	12" × 9" × 2"	1 lb	Small electronics, cables, jewelry
Small Bin	12" × 9" × 6"	10 lbs	Books, small tools, packaged goods
Tote	18" × 14" × 12"	50 lbs	Appliances, bulk items, equipment
Carton	24" × 18" × 18"	70 lbs	Large equipment, furniture parts
Oversized	No limits	No limits	Items exceeding carton constraints

The classifier always selects the smallest container that fits—a product that could fit in a Tote won't be assigned to a Carton. All three dimensions matter: a thin but long item might skip Small Bin entirely if its length exceeds the constraint.

Confidence Tiers

Each classification includes a confidence tier indicating the system's certainty:

Tier	Meaning	Typical Scenarios
HIGH	Strong conviction	Explicit dimensions provided, exact database match
MEDIUM_HIGH	Fairly confident	Good database match, clear product type
MEDIUM	Notable uncertainty	Estimated from product category
MEDIUM_LOW	Significant doubt	Vague description, ambiguous product type
LOW	Best guess	Very limited information available

Architecture

WM2 runs on AWS serverless infrastructure, with Claude handling the classification logic through an agentic loop:

flowchart TB
    subgraph Client["Client"]
        FE[S3 Static Site<br/>Frontend]
    end

    subgraph AWS["AWS Cloud"]
        APIGW[API Gateway<br/>REST + CORS + Rate Limiting]

        subgraph Lambda["Lambda Container"]
            H[Handler<br/>Input Validation]
            FB[Feedback Retrieval]
            CL[Claude Agent<br/>Haiku Model]

            subgraph Tools["Tool Execution"]
                T1[lookup_known_product]
                T2[extract_explicit_dimensions]
            end
        end

        subgraph Storage["Storage Layer"]
            DDB[(DynamoDB<br/>Feedback)]
            S3D[(S3<br/>Reference Data)]
        end

        subgraph Search["Semantic Search"]
            EMB[Embedding Service<br/>all-MiniLM-L6-v2]
            CHR[(ChromaDB<br/>Vector Index)]
        end
    end

    subgraph External["External"]
        ANTH[Anthropic API]
        ARIZE[Arize Phoenix<br/>Observability]
    end

    FE -->|POST /classify| APIGW
    APIGW --> H
    H --> FB
    FB --> DDB
    FB --> CL
    CL <-->|Messages API| ANTH
    CL -->|tool_use| T1
    CL -->|tool_use| T2
    T1 --> EMB
    EMB --> CHR
    T1 --> S3D
    CHR -.->|Index from| S3D
    CL -.->|Traces| ARIZE
    H -->|Response| APIGW
    APIGW --> FE

Request Flow:

User submits a product description via the frontend
API Gateway validates the request and applies rate limiting
Lambda handler validates input (guardrails) and retrieves relevant feedback from DynamoDB
Claude Agent receives the description with few-shot examples from feedback history
Agent decides whether to invoke tools based on the input
If tools are used, results are fed back to Claude for synthesis
Claude returns a structured classification with confidence and reasoning
Response passes through output guardrails before returning to client

Agentic Behavior

WM2's classifier operates as an autonomous agent that decides its own execution path. Rather than following a fixed pipeline, Claude dynamically chooses which tools to invoke (if any) based on the input description.

The Agentic Loop

flowchart TD
    START([Product Description]) --> INIT[Initialize Agent]
    INIT --> FEEDBACK[Retrieve Feedback<br/>for Few-Shot Context]
    FEEDBACK --> CLAUDE[Send to Claude<br/>with Tool Definitions]

    CLAUDE --> DECISION{Claude Response}

    DECISION -->|tool_use| TOOL_CHECK{Tool Call<br/>Limit OK?}
    TOOL_CHECK -->|Yes| EXECUTE[Execute Tool]
    TOOL_CHECK -->|No, > 3 calls| BLOCK[Block: Too Complex]

    EXECUTE --> LOOKUP{Which Tool?}
    LOOKUP -->|lookup_known_product| SEMANTIC[Semantic Search<br/>ChromaDB]
    LOOKUP -->|extract_explicit_dimensions| REGEX[Regex Parsing<br/>Dimensions/Weight]

    SEMANTIC --> RESULTS[Add Results<br/>to Messages]
    REGEX --> RESULTS
    RESULTS --> ITER_CHECK{Iteration < 5?}
    ITER_CHECK -->|Yes| CLAUDE
    ITER_CHECK -->|No| FORCE_END[Force End Turn]

    DECISION -->|end_turn| PARSE[Parse JSON Response]
    FORCE_END --> PARSE

    PARSE --> VALIDATE[Output Guardrails<br/>Validate Category/Confidence]
    VALIDATE --> RESPONSE([Classification Result])

    BLOCK --> ERROR([Error Response])

Tool Definitions

The agent has access to exactly two tools:

`lookup_known_product`

Searches a reference database of 479 known products using semantic similarity:

# Tool definition (simplified)
{
    "name": "lookup_known_product",
    "description": "Search the reference database for known products matching the query",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Product name or description to search for"
            }
        },
        "required": ["query"]
    }
}

How it works:

Query is embedded using all-MiniLM-L6-v2 (384-dimensional vectors)
ChromaDB performs approximate nearest neighbor search
Results are re-ranked using hybrid scoring (semantic + keyword overlap)
Top matches are returned with product name, dimensions, weight, and similarity score

`extract_explicit_dimensions`

Parses dimension and weight measurements from text using regex patterns:

# Tool definition (simplified)
{
    "name": "extract_explicit_dimensions",
    "description": "Extract explicit dimensions (LxWxH) and weight from text",
    "input_schema": {
        "type": "object",
        "properties": {
            "text": {
                "type": "string",
                "description": "Text containing dimension/weight information"
            }
        },
        "required": ["text"]
    }
}

Supported formats:

Dimensions: 10x8x4, 10"x8"x4", 10 x 8 x 4 inches, 10cm × 8cm × 4cm
Weight: 5 lbs, 2.5 kg, 16 oz, 500g

When Tools Are Invoked

The agent autonomously decides whether to use tools based on context:

Scenario	Typical Agent Behavior
"iPhone 15 Pro Max"	Calls `lookup_known_product` → likely finds exact match
"Box, 10x8x4 inches, 5 lbs"	Calls `extract_explicit_dimensions` → uses explicit measurements
"Small USB cable"	May call `lookup_known_product`, or classify directly as POUCH
"Industrial widget, model X-500"	Calls `lookup_known_product` → may find similar products
"24x18x16 shipping container, 45 lbs"	Calls `extract_explicit_dimensions` → CARTON based on dimensions

Most classifications complete in 0-2 tool calls. The behavioral guardrails limit tool calls to 3 per request to prevent pathological looping.

Feedback Memory System

WM2 learns from user corrections through a feedback loop that stores thumbs up/down responses and retrieves them as few-shot examples for future classifications.

flowchart LR
    subgraph Classification["Classification Request"]
        DESC[Product Description]
        KW[Extract Keywords]
    end

    subgraph Retrieval["Two-Tier Retrieval"]
        RECENT[Recent Tier<br/>Last 10 entries]
        KEYWORD[Keyword Tier<br/>Semantic relevance]
    end

    subgraph Context["Prompt Context"]
        DEDUP[Deduplicate]
        FORMAT[Format as<br/>Few-Shot Examples]
    end

    subgraph Storage["DynamoDB"]
        DDB[(Feedback Table)]
    end

    DESC --> KW
    KW --> KEYWORD
    DDB --> RECENT
    DDB --> KEYWORD
    RECENT --> DEDUP
    KEYWORD --> DEDUP
    DEDUP --> FORMAT
    FORMAT --> CLAUDE[Claude Agent]

    CLAUDE -->|Classification| USER[User]
    USER -->|"👍 / 👎"| STORE[Store Feedback]
    STORE --> DDB

How It Works

Storage (on feedback submission):

User submits thumbs up (correct) or thumbs down (incorrect) for a classification
System extracts keywords from the product description
Feedback is stored in DynamoDB with: description, classification, correctness, keywords, timestamp

Retrieval (on classification request):

Recency tier: Fetches the 10 most recent feedback entries (captures recent corrections)
Keyword tier: Finds entries with overlapping keywords (captures domain-relevant examples)
Results are deduplicated and formatted as few-shot examples in the system prompt

Prompt injection format:

## User Feedback History

The following are previous classifications with user feedback:

✓ CORRECT: "iPhone 15 Pro case" → POUCH
✗ INCORRECT: "Large toolbox set" → SMALL_BIN (user indicated this was wrong)
✓ CORRECT: "Hydraulic pump, 12x10x8, 25 lbs" → TOTE

This approach allows the model to learn from corrections without retraining, adapting its behavior based on accumulated feedback.

Guardrails Pipeline

WM2 implements a three-stage guardrails architecture following the GUARD (Generalized Unified Agent Risk Defense) framework. Guardrails operate at runtime to prevent harm and waste, complementing the offline evaluation system that measures quality.

flowchart LR
    subgraph Input["Input Stage"]
        I1[Length Check<br/>5-2000 chars]
        I2[Empty/Whitespace<br/>Validation]
        I3[JSON Parse<br/>Validation]
        I4[Rate Limiting<br/>30 req/sec]
    end

    subgraph Behavioral["Behavioral Stage"]
        B1[Tool Call Limit<br/>Max 3 calls]
        B2[Iteration Limit<br/>Max 5 loops]
        B3[Request Timeout<br/>25 seconds]
        B4[Fixed Tool Set<br/>2 tools only]
    end

    subgraph Output["Output Stage"]
        O1[Category Validation<br/>Must be valid]
        O2[Confidence Validation<br/>Must be valid tier]
        O3[Reasoning Truncation<br/>Max 500 chars]
    end

    REQUEST([Request]) --> Input
    Input -->|Pass| Behavioral
    Input -->|Fail| BLOCK1([400 Error])

    Behavioral -->|Pass| CLAUDE[Claude Agent]
    Behavioral -->|Fail| BLOCK2([Error Response])

    CLAUDE --> Output
    Output -->|Pass| RESPONSE([Classification])
    Output -->|Fail| BLOCK3([500 Error])

Input Stage Guardrails

Protect the system before Claude sees the request:

Guardrail	Detection	Response	Threshold
Minimum description length	Deterministic	Block (400)	< 5 characters
Maximum description length	Deterministic	Block (400)	> 2000 characters
Empty/whitespace check	Deterministic	Block (400)	Empty or whitespace-only
JSON parse validation	Deterministic	Block (400)	Invalid JSON body
API Gateway rate limiting	Deterministic	Block (429)	> 30 req/sec, 50 burst

Behavioral Stage Guardrails

Prevent runaway execution during the agentic loop:

Guardrail	Detection	Response	Threshold
Tool call limit	Counter	Block with error	> 3 tool calls per request
Iteration limit	Counter	Force end turn	> 5 agentic loop iterations
Request timeout	Timer	Graceful error	> 25 seconds (API client)
Lambda timeout	AWS	Hard termination	> 30 seconds
Fixed tool set	Allowlist	Error result	Unknown tool name

Output Stage Guardrails

Validate Claude's response before returning to the client:

Guardrail	Detection	Response	Threshold
Valid category check	Enum validation	Block (raises error)	Category not in valid set
Valid confidence tier	Enum validation	Block (raises error)	Tier not in valid set
Reasoning truncation	Length check	Truncate with "..."	> 500 characters

Structural Guardrails

Beyond runtime checks, the architecture provides inherent protections:

Fixed tool set: The agent can only use 2 pre-defined tools—no arbitrary actions possible
Single model: Hardcoded to Claude Haiku—no model escalation, predictable cost
Lambda isolation: No network egress except to Anthropic API—no data exfiltration
Structured output: JSON extraction with fallback parsing—consistent response format

Evaluation Framework

WM2 includes a comprehensive local evaluation system for measuring classifier quality. The evaluation framework operates offline (batch analysis) to complement the runtime guardrails.

flowchart TB
    subgraph Dataset["Test Dataset"]
        CSV[wm2_eval_v1.csv<br/>60 labeled cases]
    end

    subgraph Runner["Evaluation Runner"]
        LOAD[Load Dataset]
        CLASSIFY[Run Classifier]
        EVALUATE[Apply Evaluators]
        AGGREGATE[Compute Metrics]
    end

    subgraph Evaluators["8 APF Evaluators"]
        subgraph Gates["Layer 1: Gates"]
            E1[valid_category]
            E2[has_reasoning]
            E3[valid_confidence_tier]
        end

        subgraph Pillars["Layer 2: Pillars"]
            E4[fit_accuracy<br/>Effectiveness]
            E5[strict_accuracy<br/>Effectiveness]
            E6[latency_acceptable<br/>Efficiency]
            E7[overconfident_failure<br/>Reliability]
            E8[safety_weight_check<br/>Trustworthiness]
        end
    end

    subgraph Storage["SQLite Storage"]
        DB[(eval_results.db)]
        SCHEMA[EvalRun + EvalResult<br/>+ EvaluatorResult]
    end

    subgraph Presentation["Visualization"]
        CLI[CLI Commands]
        DASH[Streamlit Dashboard]
    end

    CSV --> LOAD
    LOAD --> CLASSIFY
    CLASSIFY --> EVALUATE
    EVALUATE --> Gates
    EVALUATE --> Pillars
    Gates --> AGGREGATE
    Pillars --> AGGREGATE
    AGGREGATE --> DB
    DB --> CLI
    DB --> DASH

Evaluation Philosophy

The evaluation system follows the Agent Performance Framework (APF) with three layers:

Layer 1 - Binary Gates (must-pass):

valid_category: Output must be one of the 5 valid categories
has_reasoning: Response must include non-empty reasoning
valid_confidence_tier: Confidence must be a valid tier

Layer 2 - Pillar Metrics:

Effectiveness: fit_accuracy (asymmetric—over-prediction acceptable), strict_accuracy (exact match)
Efficiency: latency_acceptable (< 5 seconds)
Reliability: overconfident_failure (penalizes HIGH confidence wrong answers)
Trustworthiness: safety_weight_check (weight within category limits)

Provenance Tracking

Every evaluation run captures:

Git commit hash and branch name
Timestamp
Dataset name and version
Per-case results with all evaluator scores
Aggregate summary metrics

This enables questions like "What changed between these two runs?" and "Which commit introduced this regression?"

CLI Usage

# Run an evaluation
python -m eval.cli run --name "baseline-v1"

# View summary of latest run
python -m eval.cli summary

# List recent runs
python -m eval.cli list --last 10

# Compare two runs
python -m eval.cli compare <run-id-1> <run-id-2>

Streamlit Dashboard

The dashboard provides interactive visualization of evaluation results:

streamlit run dashboard/app.py

Pages:

Overview: Run summary with all evaluator scores and visual indicators
Failures: Drill-down into failed cases with filtering by evaluator and category
Compare: Side-by-side run comparison with improvement/regression highlighting
History: Historical trends with interactive charts

Arize Phoenix Integration

Production traces are sent to Arize Phoenix for observability:

Request/response tracing for all Claude API calls
Latency metrics and token usage
Tool call patterns and outcomes

Tracing is fail-open—if credentials are missing, requests proceed without observability.

Development Setup

Installation

Clone the repository and install in development mode:

git clone https://github.com/EvieHwang/wm2.git
cd wm2
pip install -e .

This installs the project in editable mode, making all imports work correctly without PYTHONPATH manipulation.

Optional Dependencies

For semantic search capabilities (local development):

pip install -e ".[semantic]"

For development tools (testing, linting):

pip install -e ".[dev]"

Running Tests

# Run all tests
python -m pytest

# Run only eval tests
python -m pytest eval/tests/

# Run only backend tests
python -m pytest backend/tests/

# Run with coverage
python -m pytest --cov=backend --cov=eval

Running the Evaluation CLI

# Run an evaluation
python -m eval.cli run --name "baseline-v1"

# View summary of latest run
python -m eval.cli summary

# List recent runs
python -m eval.cli list --last 10

# Compare two runs
python -m eval.cli compare <run-id-1> <run-id-2>

Running the Dashboard

streamlit run dashboard/app.py
# Opens browser to http://localhost:8501

API Reference

POST /classify

Classify a product description into an ASRS container type.

Request:

{
  "description": "iPhone 15 Pro Max, 6.7 inch display, 221g"
}

Response:

{
  "classification": "POUCH",
  "confidence": "HIGH",
  "reasoning": "The iPhone 15 Pro Max is a smartphone with known dimensions approximately 6.3 x 3.0 x 0.3 inches and weight of 0.49 lbs, which fits comfortably within Pouch constraints.",
  "tools_used": {
    "lookup_known_product": {
      "called": true,
      "result": "Found iPhone 15 Pro Max in reference database"
    },
    "extract_explicit_dimensions": {
      "called": false
    }
  },
  "extracted_measurements": {
    "length": 6.3,
    "width": 3.0,
    "height": 0.3,
    "weight": 0.49,
    "source": "reference"
  },
  "latency_ms": 1247
}

POST /v1/feedback

Submit feedback on a classification result.

Request:

{
  "description": "iPhone 15 Pro Max, 6.7 inch display, 221g",
  "classification": "POUCH",
  "is_correct": true
}

Response:

{
  "status": "success",
  "message": "Feedback recorded"
}

GET /health

Health check endpoint.

Response:

{
  "status": "healthy"
}

Deployment

CI/CD Pipeline

This project uses GitHub Actions for continuous integration and deployment:

CI: Runs on all PRs and pushes to main (lint, test, security scan, dependency audit)
Deploy: Automatically deploys to AWS on merge to main

Required GitHub Secrets

Secret	Description
`AWS_ACCESS_KEY_ID`	AWS IAM access key with Lambda, ECR, S3, CodeBuild permissions
`AWS_SECRET_ACCESS_KEY`	AWS IAM secret key
`ANTHROPIC_API_KEY`	Anthropic API key for running tests

Manual Deployment

Prerequisites: AWS CLI configured, SAM CLI installed, Anthropic API key

# Backend
cd backend
sam build
sam deploy --guided --parameter-overrides AnthropicApiKey=your-key

# Frontend (update API_BASE_URL in app.js first)
aws s3 mb s3://your-bucket-name
aws s3 website s3://your-bucket-name --index-document index.html
aws s3 sync ../frontend s3://your-bucket-name

Tech Stack

Component	Technology
Runtime	Python 3.12, AWS Lambda
API	API Gateway (REST), CORS enabled
AI/ML	Claude API (Anthropic), sentence-transformers
Storage	DynamoDB (feedback), S3 (reference data, frontend)
Search	ChromaDB (vector embeddings), all-MiniLM-L6-v2
Deployment	AWS SAM, GitHub Actions
Evaluation	SQLite, Streamlit, Arize Phoenix
Frontend	Vanilla HTML/CSS/JavaScript

License

MIT License - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github		.github
backend		backend
dashboard		dashboard
data		data
docs		docs
eval		eval
frontend		frontend
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
agent-graph.md		agent-graph.md
agent-graph.png		agent-graph.png
app.js		app.js
pyproject.toml		pyproject.toml

EvieHwang/wm2

Folders and files

Latest commit

History

Repository files navigation

WM2: ASRS Storage Classifier

Table of Contents

What It Does

Classification Categories

Confidence Tiers

Architecture

Agentic Behavior

The Agentic Loop

Tool Definitions

lookup_known_product

extract_explicit_dimensions

When Tools Are Invoked

Feedback Memory System

How It Works

Guardrails Pipeline

Input Stage Guardrails

Behavioral Stage Guardrails

Output Stage Guardrails

Structural Guardrails

Evaluation Framework

Evaluation Philosophy

Provenance Tracking

CLI Usage

Streamlit Dashboard

Arize Phoenix Integration

Development Setup

Installation

Optional Dependencies

Running Tests

Running the Evaluation CLI

Running the Dashboard

API Reference

POST /classify

POST /v1/feedback

GET /health

Deployment

CI/CD Pipeline

Required GitHub Secrets

Manual Deployment

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`lookup_known_product`

`extract_explicit_dimensions`

Packages