An AI-powered product classification system that automatically assigns warehouse storage categories based on text descriptions. WM2 combines Claude's reasoning capabilities with tool-augmented retrieval to classify products into optimal ASRS (Automated Storage and Retrieval System) container types.
- What It Does
- Classification Categories
- Architecture
- Agentic Behavior
- Feedback Memory System
- Guardrails Pipeline
- Evaluation Framework
- Development Setup
- API Reference
- Deployment
- Tech Stack
WM2 solves the problem of automatically categorizing warehouse products into the right storage containers. Given a product description—anything from "USB-C cable, 6ft" to "Industrial hydraulic pump, 45 lbs, 18x12x14 inches"—the system determines which ASRS container type can safely and efficiently hold that product.
The classifier operates as an agentic system: rather than making a single inference call, it can autonomously decide to use tools to gather additional information before making its classification decision. This tool-augmented approach allows it to:
- Look up known products in a reference database of 479 items using semantic search
- Extract explicit dimensions from descriptions using regex pattern matching
- Apply constraint-based rules to select the smallest container that fits
Each classification includes a confidence tier and reasoning, making the system's decision-making transparent and auditable.
Products are classified into one of five container types, each with specific dimension and weight constraints:
| Category | Max Dimensions (L×W×H) | Max Weight | Typical Products |
|---|---|---|---|
| Pouch | 12" × 9" × 2" | 1 lb | Small electronics, cables, jewelry |
| Small Bin | 12" × 9" × 6" | 10 lbs | Books, small tools, packaged goods |
| Tote | 18" × 14" × 12" | 50 lbs | Appliances, bulk items, equipment |
| Carton | 24" × 18" × 18" | 70 lbs | Large equipment, furniture parts |
| Oversized | No limits | No limits | Items exceeding carton constraints |
The classifier always selects the smallest container that fits—a product that could fit in a Tote won't be assigned to a Carton. All three dimensions matter: a thin but long item might skip Small Bin entirely if its length exceeds the constraint.
Each classification includes a confidence tier indicating the system's certainty:
| Tier | Meaning | Typical Scenarios |
|---|---|---|
| HIGH | Strong conviction | Explicit dimensions provided, exact database match |
| MEDIUM_HIGH | Fairly confident | Good database match, clear product type |
| MEDIUM | Notable uncertainty | Estimated from product category |
| MEDIUM_LOW | Significant doubt | Vague description, ambiguous product type |
| LOW | Best guess | Very limited information available |
WM2 runs on AWS serverless infrastructure, with Claude handling the classification logic through an agentic loop:
flowchart TB
subgraph Client["Client"]
FE[S3 Static Site<br/>Frontend]
end
subgraph AWS["AWS Cloud"]
APIGW[API Gateway<br/>REST + CORS + Rate Limiting]
subgraph Lambda["Lambda Container"]
H[Handler<br/>Input Validation]
FB[Feedback Retrieval]
CL[Claude Agent<br/>Haiku Model]
subgraph Tools["Tool Execution"]
T1[lookup_known_product]
T2[extract_explicit_dimensions]
end
end
subgraph Storage["Storage Layer"]
DDB[(DynamoDB<br/>Feedback)]
S3D[(S3<br/>Reference Data)]
end
subgraph Search["Semantic Search"]
EMB[Embedding Service<br/>all-MiniLM-L6-v2]
CHR[(ChromaDB<br/>Vector Index)]
end
end
subgraph External["External"]
ANTH[Anthropic API]
ARIZE[Arize Phoenix<br/>Observability]
end
FE -->|POST /classify| APIGW
APIGW --> H
H --> FB
FB --> DDB
FB --> CL
CL <-->|Messages API| ANTH
CL -->|tool_use| T1
CL -->|tool_use| T2
T1 --> EMB
EMB --> CHR
T1 --> S3D
CHR -.->|Index from| S3D
CL -.->|Traces| ARIZE
H -->|Response| APIGW
APIGW --> FE
Request Flow:
- User submits a product description via the frontend
- API Gateway validates the request and applies rate limiting
- Lambda handler validates input (guardrails) and retrieves relevant feedback from DynamoDB
- Claude Agent receives the description with few-shot examples from feedback history
- Agent decides whether to invoke tools based on the input
- If tools are used, results are fed back to Claude for synthesis
- Claude returns a structured classification with confidence and reasoning
- Response passes through output guardrails before returning to client
WM2's classifier operates as an autonomous agent that decides its own execution path. Rather than following a fixed pipeline, Claude dynamically chooses which tools to invoke (if any) based on the input description.
flowchart TD
START([Product Description]) --> INIT[Initialize Agent]
INIT --> FEEDBACK[Retrieve Feedback<br/>for Few-Shot Context]
FEEDBACK --> CLAUDE[Send to Claude<br/>with Tool Definitions]
CLAUDE --> DECISION{Claude Response}
DECISION -->|tool_use| TOOL_CHECK{Tool Call<br/>Limit OK?}
TOOL_CHECK -->|Yes| EXECUTE[Execute Tool]
TOOL_CHECK -->|No, > 3 calls| BLOCK[Block: Too Complex]
EXECUTE --> LOOKUP{Which Tool?}
LOOKUP -->|lookup_known_product| SEMANTIC[Semantic Search<br/>ChromaDB]
LOOKUP -->|extract_explicit_dimensions| REGEX[Regex Parsing<br/>Dimensions/Weight]
SEMANTIC --> RESULTS[Add Results<br/>to Messages]
REGEX --> RESULTS
RESULTS --> ITER_CHECK{Iteration < 5?}
ITER_CHECK -->|Yes| CLAUDE
ITER_CHECK -->|No| FORCE_END[Force End Turn]
DECISION -->|end_turn| PARSE[Parse JSON Response]
FORCE_END --> PARSE
PARSE --> VALIDATE[Output Guardrails<br/>Validate Category/Confidence]
VALIDATE --> RESPONSE([Classification Result])
BLOCK --> ERROR([Error Response])
The agent has access to exactly two tools:
Searches a reference database of 479 known products using semantic similarity:
# Tool definition (simplified)
{
"name": "lookup_known_product",
"description": "Search the reference database for known products matching the query",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Product name or description to search for"
}
},
"required": ["query"]
}
}How it works:
- Query is embedded using
all-MiniLM-L6-v2(384-dimensional vectors) - ChromaDB performs approximate nearest neighbor search
- Results are re-ranked using hybrid scoring (semantic + keyword overlap)
- Top matches are returned with product name, dimensions, weight, and similarity score
Parses dimension and weight measurements from text using regex patterns:
# Tool definition (simplified)
{
"name": "extract_explicit_dimensions",
"description": "Extract explicit dimensions (LxWxH) and weight from text",
"input_schema": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "Text containing dimension/weight information"
}
},
"required": ["text"]
}
}Supported formats:
- Dimensions:
10x8x4,10"x8"x4",10 x 8 x 4 inches,10cm × 8cm × 4cm - Weight:
5 lbs,2.5 kg,16 oz,500g
The agent autonomously decides whether to use tools based on context:
| Scenario | Typical Agent Behavior |
|---|---|
| "iPhone 15 Pro Max" | Calls lookup_known_product → likely finds exact match |
| "Box, 10x8x4 inches, 5 lbs" | Calls extract_explicit_dimensions → uses explicit measurements |
| "Small USB cable" | May call lookup_known_product, or classify directly as POUCH |
| "Industrial widget, model X-500" | Calls lookup_known_product → may find similar products |
| "24x18x16 shipping container, 45 lbs" | Calls extract_explicit_dimensions → CARTON based on dimensions |
Most classifications complete in 0-2 tool calls. The behavioral guardrails limit tool calls to 3 per request to prevent pathological looping.
WM2 learns from user corrections through a feedback loop that stores thumbs up/down responses and retrieves them as few-shot examples for future classifications.
flowchart LR
subgraph Classification["Classification Request"]
DESC[Product Description]
KW[Extract Keywords]
end
subgraph Retrieval["Two-Tier Retrieval"]
RECENT[Recent Tier<br/>Last 10 entries]
KEYWORD[Keyword Tier<br/>Semantic relevance]
end
subgraph Context["Prompt Context"]
DEDUP[Deduplicate]
FORMAT[Format as<br/>Few-Shot Examples]
end
subgraph Storage["DynamoDB"]
DDB[(Feedback Table)]
end
DESC --> KW
KW --> KEYWORD
DDB --> RECENT
DDB --> KEYWORD
RECENT --> DEDUP
KEYWORD --> DEDUP
DEDUP --> FORMAT
FORMAT --> CLAUDE[Claude Agent]
CLAUDE -->|Classification| USER[User]
USER -->|"👍 / 👎"| STORE[Store Feedback]
STORE --> DDB
Storage (on feedback submission):
- User submits thumbs up (correct) or thumbs down (incorrect) for a classification
- System extracts keywords from the product description
- Feedback is stored in DynamoDB with: description, classification, correctness, keywords, timestamp
Retrieval (on classification request):
- Recency tier: Fetches the 10 most recent feedback entries (captures recent corrections)
- Keyword tier: Finds entries with overlapping keywords (captures domain-relevant examples)
- Results are deduplicated and formatted as few-shot examples in the system prompt
Prompt injection format:
## User Feedback History
The following are previous classifications with user feedback:
✓ CORRECT: "iPhone 15 Pro case" → POUCH
✗ INCORRECT: "Large toolbox set" → SMALL_BIN (user indicated this was wrong)
✓ CORRECT: "Hydraulic pump, 12x10x8, 25 lbs" → TOTE
This approach allows the model to learn from corrections without retraining, adapting its behavior based on accumulated feedback.
WM2 implements a three-stage guardrails architecture following the GUARD (Generalized Unified Agent Risk Defense) framework. Guardrails operate at runtime to prevent harm and waste, complementing the offline evaluation system that measures quality.
flowchart LR
subgraph Input["Input Stage"]
I1[Length Check<br/>5-2000 chars]
I2[Empty/Whitespace<br/>Validation]
I3[JSON Parse<br/>Validation]
I4[Rate Limiting<br/>30 req/sec]
end
subgraph Behavioral["Behavioral Stage"]
B1[Tool Call Limit<br/>Max 3 calls]
B2[Iteration Limit<br/>Max 5 loops]
B3[Request Timeout<br/>25 seconds]
B4[Fixed Tool Set<br/>2 tools only]
end
subgraph Output["Output Stage"]
O1[Category Validation<br/>Must be valid]
O2[Confidence Validation<br/>Must be valid tier]
O3[Reasoning Truncation<br/>Max 500 chars]
end
REQUEST([Request]) --> Input
Input -->|Pass| Behavioral
Input -->|Fail| BLOCK1([400 Error])
Behavioral -->|Pass| CLAUDE[Claude Agent]
Behavioral -->|Fail| BLOCK2([Error Response])
CLAUDE --> Output
Output -->|Pass| RESPONSE([Classification])
Output -->|Fail| BLOCK3([500 Error])
Protect the system before Claude sees the request:
| Guardrail | Detection | Response | Threshold |
|---|---|---|---|
| Minimum description length | Deterministic | Block (400) | < 5 characters |
| Maximum description length | Deterministic | Block (400) | > 2000 characters |
| Empty/whitespace check | Deterministic | Block (400) | Empty or whitespace-only |
| JSON parse validation | Deterministic | Block (400) | Invalid JSON body |
| API Gateway rate limiting | Deterministic | Block (429) | > 30 req/sec, 50 burst |
Prevent runaway execution during the agentic loop:
| Guardrail | Detection | Response | Threshold |
|---|---|---|---|
| Tool call limit | Counter | Block with error | > 3 tool calls per request |
| Iteration limit | Counter | Force end turn | > 5 agentic loop iterations |
| Request timeout | Timer | Graceful error | > 25 seconds (API client) |
| Lambda timeout | AWS | Hard termination | > 30 seconds |
| Fixed tool set | Allowlist | Error result | Unknown tool name |
Validate Claude's response before returning to the client:
| Guardrail | Detection | Response | Threshold |
|---|---|---|---|
| Valid category check | Enum validation | Block (raises error) | Category not in valid set |
| Valid confidence tier | Enum validation | Block (raises error) | Tier not in valid set |
| Reasoning truncation | Length check | Truncate with "..." | > 500 characters |
Beyond runtime checks, the architecture provides inherent protections:
- Fixed tool set: The agent can only use 2 pre-defined tools—no arbitrary actions possible
- Single model: Hardcoded to Claude Haiku—no model escalation, predictable cost
- Lambda isolation: No network egress except to Anthropic API—no data exfiltration
- Structured output: JSON extraction with fallback parsing—consistent response format
WM2 includes a comprehensive local evaluation system for measuring classifier quality. The evaluation framework operates offline (batch analysis) to complement the runtime guardrails.
flowchart TB
subgraph Dataset["Test Dataset"]
CSV[wm2_eval_v1.csv<br/>60 labeled cases]
end
subgraph Runner["Evaluation Runner"]
LOAD[Load Dataset]
CLASSIFY[Run Classifier]
EVALUATE[Apply Evaluators]
AGGREGATE[Compute Metrics]
end
subgraph Evaluators["8 APF Evaluators"]
subgraph Gates["Layer 1: Gates"]
E1[valid_category]
E2[has_reasoning]
E3[valid_confidence_tier]
end
subgraph Pillars["Layer 2: Pillars"]
E4[fit_accuracy<br/>Effectiveness]
E5[strict_accuracy<br/>Effectiveness]
E6[latency_acceptable<br/>Efficiency]
E7[overconfident_failure<br/>Reliability]
E8[safety_weight_check<br/>Trustworthiness]
end
end
subgraph Storage["SQLite Storage"]
DB[(eval_results.db)]
SCHEMA[EvalRun + EvalResult<br/>+ EvaluatorResult]
end
subgraph Presentation["Visualization"]
CLI[CLI Commands]
DASH[Streamlit Dashboard]
end
CSV --> LOAD
LOAD --> CLASSIFY
CLASSIFY --> EVALUATE
EVALUATE --> Gates
EVALUATE --> Pillars
Gates --> AGGREGATE
Pillars --> AGGREGATE
AGGREGATE --> DB
DB --> CLI
DB --> DASH
The evaluation system follows the Agent Performance Framework (APF) with three layers:
Layer 1 - Binary Gates (must-pass):
valid_category: Output must be one of the 5 valid categorieshas_reasoning: Response must include non-empty reasoningvalid_confidence_tier: Confidence must be a valid tier
Layer 2 - Pillar Metrics:
- Effectiveness:
fit_accuracy(asymmetric—over-prediction acceptable),strict_accuracy(exact match) - Efficiency:
latency_acceptable(< 5 seconds) - Reliability:
overconfident_failure(penalizes HIGH confidence wrong answers) - Trustworthiness:
safety_weight_check(weight within category limits)
Every evaluation run captures:
- Git commit hash and branch name
- Timestamp
- Dataset name and version
- Per-case results with all evaluator scores
- Aggregate summary metrics
This enables questions like "What changed between these two runs?" and "Which commit introduced this regression?"
# Run an evaluation
python -m eval.cli run --name "baseline-v1"
# View summary of latest run
python -m eval.cli summary
# List recent runs
python -m eval.cli list --last 10
# Compare two runs
python -m eval.cli compare <run-id-1> <run-id-2>The dashboard provides interactive visualization of evaluation results:
streamlit run dashboard/app.pyPages:
- Overview: Run summary with all evaluator scores and visual indicators
- Failures: Drill-down into failed cases with filtering by evaluator and category
- Compare: Side-by-side run comparison with improvement/regression highlighting
- History: Historical trends with interactive charts
Production traces are sent to Arize Phoenix for observability:
- Request/response tracing for all Claude API calls
- Latency metrics and token usage
- Tool call patterns and outcomes
Tracing is fail-open—if credentials are missing, requests proceed without observability.
Clone the repository and install in development mode:
git clone https://github.com/EvieHwang/wm2.git
cd wm2
pip install -e .This installs the project in editable mode, making all imports work correctly without PYTHONPATH manipulation.
For semantic search capabilities (local development):
pip install -e ".[semantic]"For development tools (testing, linting):
pip install -e ".[dev]"# Run all tests
python -m pytest
# Run only eval tests
python -m pytest eval/tests/
# Run only backend tests
python -m pytest backend/tests/
# Run with coverage
python -m pytest --cov=backend --cov=eval# Run an evaluation
python -m eval.cli run --name "baseline-v1"
# View summary of latest run
python -m eval.cli summary
# List recent runs
python -m eval.cli list --last 10
# Compare two runs
python -m eval.cli compare <run-id-1> <run-id-2>streamlit run dashboard/app.py
# Opens browser to http://localhost:8501Classify a product description into an ASRS container type.
Request:
{
"description": "iPhone 15 Pro Max, 6.7 inch display, 221g"
}Response:
{
"classification": "POUCH",
"confidence": "HIGH",
"reasoning": "The iPhone 15 Pro Max is a smartphone with known dimensions approximately 6.3 x 3.0 x 0.3 inches and weight of 0.49 lbs, which fits comfortably within Pouch constraints.",
"tools_used": {
"lookup_known_product": {
"called": true,
"result": "Found iPhone 15 Pro Max in reference database"
},
"extract_explicit_dimensions": {
"called": false
}
},
"extracted_measurements": {
"length": 6.3,
"width": 3.0,
"height": 0.3,
"weight": 0.49,
"source": "reference"
},
"latency_ms": 1247
}Submit feedback on a classification result.
Request:
{
"description": "iPhone 15 Pro Max, 6.7 inch display, 221g",
"classification": "POUCH",
"is_correct": true
}Response:
{
"status": "success",
"message": "Feedback recorded"
}Health check endpoint.
Response:
{
"status": "healthy"
}This project uses GitHub Actions for continuous integration and deployment:
- CI: Runs on all PRs and pushes to main (lint, test, security scan, dependency audit)
- Deploy: Automatically deploys to AWS on merge to main
| Secret | Description |
|---|---|
AWS_ACCESS_KEY_ID |
AWS IAM access key with Lambda, ECR, S3, CodeBuild permissions |
AWS_SECRET_ACCESS_KEY |
AWS IAM secret key |
ANTHROPIC_API_KEY |
Anthropic API key for running tests |
Prerequisites: AWS CLI configured, SAM CLI installed, Anthropic API key
# Backend
cd backend
sam build
sam deploy --guided --parameter-overrides AnthropicApiKey=your-key
# Frontend (update API_BASE_URL in app.js first)
aws s3 mb s3://your-bucket-name
aws s3 website s3://your-bucket-name --index-document index.html
aws s3 sync ../frontend s3://your-bucket-name| Component | Technology |
|---|---|
| Runtime | Python 3.12, AWS Lambda |
| API | API Gateway (REST), CORS enabled |
| AI/ML | Claude API (Anthropic), sentence-transformers |
| Storage | DynamoDB (feedback), S3 (reference data, frontend) |
| Search | ChromaDB (vector embeddings), all-MiniLM-L6-v2 |
| Deployment | AWS SAM, GitHub Actions |
| Evaluation | SQLite, Streamlit, Arize Phoenix |
| Frontend | Vanilla HTML/CSS/JavaScript |
MIT License - See LICENSE for details.