Skip to content

Expand classification prompt for improved accuracy#10

Merged
pandarun merged 3 commits intomainfrom
feature/improve-classification-prompt
Oct 16, 2025
Merged

Expand classification prompt for improved accuracy#10
pandarun merged 3 commits intomainfrom
feature/improve-classification-prompt

Conversation

@pandarun
Copy link
Copy Markdown
Owner

Summary

This PR significantly improves classification accuracy by expanding the few-shot prompt from 5 examples to all 35+ subcategory examples, along with comprehensive keyword maps and disambiguation rules.

Changes

1. Classification Prompt Expansion (src/classification/prompt_builder.py)

  • ✅ Include all 35+ subcategory examples instead of just 5
  • ✅ Add comprehensive keyword map with trigger phrases for each subcategory
  • ✅ Add detailed disambiguation guidance for close variants
  • ✅ Improve matching algorithm with explicit prioritization

2. Frontend Timeout Increase (frontend/src/services/api.ts)

  • ✅ Increase classification timeout from 15s to 30s
  • ✅ Accommodate larger prompt size (~900 additional tokens)

3. Input Sanitization (src/utils/validation.py)

  • ✅ Remove automatic question mark appending

Test Results

Successfully classified products that were NOT in the original 5 examples:

СуперСемь (Deposits)

  • Category: Продукты - Вклады ✓
  • Subcategory: Рублевые - СуперСемь ✓
  • Confidence: 96%

ЧЕРЕПАХА (Installment Cards)

  • Category: Продукты - Карты ✓
  • Subcategory: Карты рассрочки - ЧЕРЕПАХА ✓
  • Confidence: 97%

Impact

Pros:

  • Significantly better accuracy for rare/specific product subcategories
  • Every subcategory now has in-context representation
  • Should improve validation scores (10 points per correct classification)

Cons:

  • Response times: 3-10 seconds (was 2-5 seconds)
  • Higher token usage per classification
  • More susceptible to API rate limits during heavy usage

Performance Trade-off

The accuracy improvement justifies the performance cost for the hackathon evaluation, where classification accuracy is worth 30 points total.

🤖 Generated with Claude Code

schernykh and others added 2 commits October 15, 2025 23:08
- Expand few-shot examples from 3 to 12 covering all categories
- Add missing examples for cards and credits (was causing 50% accuracy on cards)
- Add comprehensive classification rules for distinguishing:
  * Product-specific categories (when product name mentioned)
  * Generic "Частные клиенты" categories (when no product name)
- Improve category presentation with logical grouping and inline hints
- Add special rules for deposits, cards, credits, and other categories

Expected impact:
- Overall accuracy: 90% → 95%+
- Cards category: 50% → 90%+
- Better handling of edge cases

Changes:
- src/classification/prompt_builder.py:
  * FEW_SHOT_EXAMPLES: 4 examples → 12 examples with full coverage
  * _build_system_prompt(): Enhanced with A/B rules and product-specific guidance
  * _format_categories(): Improved with logical ordering and category descriptions
- Use all 35+ subcategory examples in prompt instead of just 5
- Add comprehensive keyword map and disambiguation rules
- Increase frontend classification timeout from 15s to 30s
- Remove question mark auto-append in inquiry sanitization

This improves classification accuracy for rare/specific products like
СуперСемь and ЧЕРЕПАХА that weren't covered in the original 5 examples.
The timeout increase accommodates the larger prompt size (~900 tokens).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude bot commented Oct 16, 2025

Pull Request Review: Expand classification prompt for improved accuracy

Summary

This PR significantly expands the few-shot learning prompt from 5 examples to 35+ examples, adds comprehensive keyword mapping, and includes detailed disambiguation rules. The changes aim to improve classification accuracy at the cost of increased latency and token usage.

Code Quality & Best Practices ✅

Strengths

  1. Well-Structured Data Organization

    • The SUBCATEGORY_GUIDE using OrderedDict provides excellent maintainability
    • Clear separation of concerns: examples, keywords, and classification logic
    • Good use of constants for confidence defaults
  2. Comprehensive Documentation

    • Russian-language prompt instructions are detailed and thorough
    • Clear disambiguation rules help the LLM distinguish between similar categories
    • The hierarchical matching algorithm (exact match → FAQ template → rules) is logical
  3. Good Code Practices

    • Dynamic generation of FEW_SHOT_EXAMPLES from SUBCATEGORY_GUIDE eliminates duplication
    • Proper use of type hints throughout
    • Methods follow single responsibility principle
  4. Thoughtful Timeout Adjustment

    • Frontend timeout increased from 15s to 30s is appropriate given the larger prompt
    • Well-documented in code comments

Potential Issues 🔍

1. Memory and Performance Concerns (Medium Priority)

Issue: The prompt is now quite large (~35+ examples + keyword maps + rules). While you mention ~900 additional tokens, the actual impact could be higher.

Location: src/classification/prompt_builder.py:437

2. Inconsistent Subcategory Name Format (Low Priority)

Issue: Some subcategories use different naming conventions. Ensure consistency with the actual FAQ database schema.

Location: src/classification/prompt_builder.py:171-189

3. Removed Input Sanitization Logic (Medium-High Priority)

Issue: The PR removes automatic question mark appending from sanitize_inquiry(). This is a good change for more natural input, but verify no other code depends on this punctuation guarantee.

Location: src/utils/validation.py:93

4. Potential Test Failures (High Priority)

Issue: The test file tests/unit/test_prompt_builder.py:48 looks for specific text that may no longer be in the new FEW_SHOT_EXAMPLES.

Recommendation: Update tests to match new examples.

Location: tests/unit/test_prompt_builder.py:48

Performance Considerations ⚡

Expected Impact

  • ✅ Response time increase: 2-5s → 3-10s (acceptable for hackathon)
  • ✅ Higher token usage per request (~900 additional input tokens)
  • ⚠️ Potential rate limiting issues under load

Recommendations

  1. Add performance monitoring for prompt size
  2. Consider prompt caching if Scibox API supports it
  3. Load test with concurrent requests

Test Coverage 📊

Current Test Issues

  1. ❌ Test likely to fail: test_few_shot_examples_in_prompt (line 68-76)
  2. ❌ Test likely to fail: test_system_prompt_structure (line 48)
  3. ✅ Other structural tests should pass

Recommendations

  1. Update unit tests to match new examples
  2. Add integration tests for new subcategories (СуперСемь, ЧЕРЕПАХА)
  3. Add performance regression tests

Final Recommendation

✅ APPROVE with minor changes

This PR delivers on its promise: improved accuracy through comprehensive few-shot learning. The trade-offs are acceptable for hackathon evaluation where classification accuracy is worth 30 points.

Required Before Merge

  1. ✅ Update unit tests to match new examples
  2. ✅ Verify no test failures in CI/CD
  3. ⚠️ Consider adding prompt size monitoring

Great work on the comprehensive prompt engineering! The structured approach with keywords and disambiguation rules shows deep understanding of the classification challenges. 🚀

Update test_prompt_builder.py to match new prompt format:
- Replace old example check with generic format indicator
- Update category format test for new indentation style
- All 8 tests now pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude bot commented Oct 16, 2025

Pull Request Review: Expand classification prompt for improved accuracy

Overview

This PR significantly expands the classification prompt from 5 examples to 35+ subcategory examples, adding comprehensive keyword maps and disambiguation rules to improve classification accuracy. The changes are well-aligned with the hackathon's evaluation criteria (30 points for classification quality).


✅ Strengths

1. Comprehensive Few-Shot Learning

  • Excellent: All 35+ subcategories now have representative examples, eliminating coverage gaps
  • The SUBCATEGORY_GUIDE OrderedDict provides structured mapping with examples and keywords for each subcategory
  • This should significantly improve accuracy on rare/specific products like "СуперСемь" and "ЧЕРЕПАХА"

2. Disambiguation Rules

  • The expanded prompt includes detailed disambiguation guidance for close variants (lines 306-313 in prompt_builder.py)
  • Clear rules distinguish between product-specific vs. general inquiries (e.g., "Продукты - Вклады" vs. "Частные клиенты - Вклады и депозиты")
  • Special handling for валютные/рублевые deposits and онлайн variants

3. Test Coverage

  • Unit tests updated appropriately for the new prompt structure (test_prompt_builder.py:47-48)
  • Tests verify all examples are included in the prompt

4. Sensible Timeout Adjustment

  • Frontend timeout increased from 15s to 30s (api.ts:32) to accommodate larger prompt
  • Good documentation of the reasoning

⚠️ Issues & Concerns

1. Performance Impact (Medium Priority)

Issue: Response times increased from 2-5s to 3-10s, with potential for higher token costs and API rate limiting during heavy usage.

Recommendation:

  • Monitor actual classification latency in production/demo
  • Consider implementing request caching for common inquiries
  • If performance becomes problematic, consider a hybrid approach:
    • Use lighter 10-15 example prompt for common categories
    • Fall back to full prompt only for edge cases or low-confidence results
# Example hybrid approach
if confidence < 0.85:
    # Retry with full expanded prompt
    result = classify_with_expanded_prompt(inquiry)

2. Validation Removal May Reduce Consistency (Low-Medium Priority)

Issue: Removed automatic question mark appending from sanitize_inquiry() (validation.py:93-98)

Concern: LLM responses can be sensitive to punctuation. Without consistent formatting, the model might perform differently on:

  • "Какая процентная ставка" (no punctuation)
  • "Какая процентная ставка?" (with question mark)

Recommendation:

  • Run A/B tests on validation dataset to confirm this doesn't degrade accuracy
  • If accuracy drops, consider restoring the normalization but make it configurable
  • Document the reasoning for this change in code comments

3. Hard-Coded Confidence Values (Low Priority)

Issue: CATEGORY_CONFIDENCE_DEFAULTS uses hard-coded confidence values (lines 25-32)

Concern: These defaults are used in few-shot examples, but they don't reflect actual model behavior. This could:

  • Mislead the LLM about what constitutes high/low confidence
  • Create anchoring bias

Recommendation:

# Better approach: derive from actual model performance
CATEGORY_CONFIDENCE_DEFAULTS = {
    "Новые клиенты": 0.93,  # Based on validation set performance
    "Техническая поддержка": 0.95,  # 95% accuracy on test set
    # Add comments explaining where these values come from
}

4. Prompt Size and Token Budget (Medium Priority)

Issue: The expanded prompt is now ~900 additional tokens (~1,200-1,500 tokens total estimated)

Concerns:

  • Cost scaling: For high-volume deployments, this significantly increases API costs
  • Context window: While manageable for Qwen2.5-72B, leaves less room for longer customer inquiries
  • Latency: More tokens = longer processing time

Recommendations:

  • Add prompt token counting to metrics/logging
  • Consider prompt compression techniques:
    • Remove verbose examples, keep only keywords
    • Use abbreviations for repeated phrases
    • Template common patterns

Example optimization:

# Before (verbose)
"example_inquiry": "Как стать клиентом банка онлайн?",
"keywords": ["регистрация", "новый клиент", "МСИ", "идентификация"]

# After (compressed)
"kw": ["регистрация", "новый клиент", "МСИ", "идентификация"]
# Remove example if keywords are sufficient

5. Missing Integration Tests (Medium Priority)

Issue: No integration tests verify the expanded prompt actually improves accuracy on validation dataset

Recommendation:
Add integration test that runs classification on the 3 validation questions mentioned in PR description:

# tests/integration/test_expanded_prompt_accuracy.py
def test_validation_dataset_accuracy():
    """Verify expanded prompt improves accuracy on validation cases."""
    test_cases = [
        ("СуперСемь", "Продукты - Вклады", "Рублевые - СуперСемь"),
        ("ЧЕРЕПАХА", "Продукты - Карты", "Карты рассрочки - ЧЕРЕПАХА"),
        # Add all validation cases
    ]
    for inquiry, expected_cat, expected_subcat in test_cases:
        result = classifier.classify(inquiry)
        assert result.category == expected_cat
        assert result.subcategory == expected_subcat

6. Keyword Case Sensitivity (Low Priority)

Issue: Keywords are lowercase but Russian text commonly uses capitalization

Example: "ЧЕРЕПАХА" vs "черепаха" in keywords

Recommendation: Document that keyword matching should be case-insensitive, or normalize keywords to match expected input


🔒 Security Considerations

No security concerns identified

  • No credential handling changes
  • No SQL injection vectors introduced
  • Input sanitization still in place (though modified)

🧪 Testing Recommendations

  1. Run validation dataset through expanded prompt

    • Measure accuracy improvement vs. baseline
    • Document results in PR or separate test report
  2. Performance benchmarks

    • Measure actual latency distribution (p50, p95, p99)
    • Compare token usage before/after
    • Test under load (concurrent requests)
  3. Edge case testing

    • Test with malformed inputs (no punctuation, all caps, mixed languages)
    • Test with inquiries that don't match any category well
    • Test boundary cases between similar subcategories
  4. Regression testing

    • Ensure previous correct classifications still work
    • Run full test suite to catch unintended side effects

📝 Code Quality

Positive aspects:

  • ✅ Clear code structure with good separation of concerns
  • ✅ Comprehensive docstrings
  • ✅ Consistent formatting and naming conventions
  • ✅ Good use of OrderedDict for maintaining category order

Minor improvements:

  • Consider extracting the massive prompt string to a separate template file for maintainability
  • Add more inline comments explaining the disambiguation logic
  • Consider adding type hints to _format_keyword_map() return type

🎯 Final Recommendation

APPROVE with suggestions

This PR makes meaningful improvements to classification accuracy by providing comprehensive coverage of all subcategories. The approach is sound for a hackathon context where accuracy is heavily weighted in scoring.

Before merging:

  1. ✅ Run validation dataset tests to confirm accuracy improvements
  2. ⚠️ Monitor performance metrics and consider optimizations if needed
  3. ⚠️ Add integration tests for validation cases
  4. 💡 Consider documenting the prompt engineering decisions for future reference

Trade-off justification: The performance cost (2-5s → 3-10s) is acceptable for a hackathon demo where classification accuracy is worth 30 points. For production deployment, consider the optimization strategies mentioned above.


Great work on systematically improving the classification system! The structured approach with keywords and disambiguation rules shows good prompt engineering practices. 🚀

Generated by Claude Code

@pandarun pandarun merged commit 7c66794 into main Oct 16, 2025
1 check passed
@pandarun pandarun deleted the feature/improve-classification-prompt branch October 16, 2025 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant