Expand classification prompt for improved accuracy#10
Conversation
- Expand few-shot examples from 3 to 12 covering all categories - Add missing examples for cards and credits (was causing 50% accuracy on cards) - Add comprehensive classification rules for distinguishing: * Product-specific categories (when product name mentioned) * Generic "Частные клиенты" categories (when no product name) - Improve category presentation with logical grouping and inline hints - Add special rules for deposits, cards, credits, and other categories Expected impact: - Overall accuracy: 90% → 95%+ - Cards category: 50% → 90%+ - Better handling of edge cases Changes: - src/classification/prompt_builder.py: * FEW_SHOT_EXAMPLES: 4 examples → 12 examples with full coverage * _build_system_prompt(): Enhanced with A/B rules and product-specific guidance * _format_categories(): Improved with logical ordering and category descriptions
- Use all 35+ subcategory examples in prompt instead of just 5 - Add comprehensive keyword map and disambiguation rules - Increase frontend classification timeout from 15s to 30s - Remove question mark auto-append in inquiry sanitization This improves classification accuracy for rare/specific products like СуперСемь and ЧЕРЕПАХА that weren't covered in the original 5 examples. The timeout increase accommodates the larger prompt size (~900 tokens). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Pull Request Review: Expand classification prompt for improved accuracySummaryThis PR significantly expands the few-shot learning prompt from 5 examples to 35+ examples, adds comprehensive keyword mapping, and includes detailed disambiguation rules. The changes aim to improve classification accuracy at the cost of increased latency and token usage. Code Quality & Best Practices ✅Strengths
Potential Issues 🔍1. Memory and Performance Concerns (Medium Priority)Issue: The prompt is now quite large (~35+ examples + keyword maps + rules). While you mention ~900 additional tokens, the actual impact could be higher. Location: src/classification/prompt_builder.py:437 2. Inconsistent Subcategory Name Format (Low Priority)Issue: Some subcategories use different naming conventions. Ensure consistency with the actual FAQ database schema. Location: src/classification/prompt_builder.py:171-189 3. Removed Input Sanitization Logic (Medium-High Priority)Issue: The PR removes automatic question mark appending from sanitize_inquiry(). This is a good change for more natural input, but verify no other code depends on this punctuation guarantee. Location: src/utils/validation.py:93 4. Potential Test Failures (High Priority)Issue: The test file tests/unit/test_prompt_builder.py:48 looks for specific text that may no longer be in the new FEW_SHOT_EXAMPLES. Recommendation: Update tests to match new examples. Location: tests/unit/test_prompt_builder.py:48 Performance Considerations ⚡Expected Impact
Recommendations
Test Coverage 📊Current Test Issues
Recommendations
Final Recommendation✅ APPROVE with minor changes This PR delivers on its promise: improved accuracy through comprehensive few-shot learning. The trade-offs are acceptable for hackathon evaluation where classification accuracy is worth 30 points. Required Before Merge
Great work on the comprehensive prompt engineering! The structured approach with keywords and disambiguation rules shows deep understanding of the classification challenges. 🚀 |
Update test_prompt_builder.py to match new prompt format: - Replace old example check with generic format indicator - Update category format test for new indentation style - All 8 tests now pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Pull Request Review: Expand classification prompt for improved accuracyOverviewThis PR significantly expands the classification prompt from 5 examples to 35+ subcategory examples, adding comprehensive keyword maps and disambiguation rules to improve classification accuracy. The changes are well-aligned with the hackathon's evaluation criteria (30 points for classification quality). ✅ Strengths1. Comprehensive Few-Shot Learning
2. Disambiguation Rules
3. Test Coverage
4. Sensible Timeout Adjustment
|
Summary
This PR significantly improves classification accuracy by expanding the few-shot prompt from 5 examples to all 35+ subcategory examples, along with comprehensive keyword maps and disambiguation rules.
Changes
1. Classification Prompt Expansion (
src/classification/prompt_builder.py)2. Frontend Timeout Increase (
frontend/src/services/api.ts)3. Input Sanitization (
src/utils/validation.py)Test Results
Successfully classified products that were NOT in the original 5 examples:
СуперСемь (Deposits)
ЧЕРЕПАХА (Installment Cards)
Impact
Pros:
Cons:
Performance Trade-off
The accuracy improvement justifies the performance cost for the hackathon evaluation, where classification accuracy is worth 30 points total.
🤖 Generated with Claude Code