AI moderation system for dating apps that doesn't flag "Want to grab coffee?" as harassment.
Most moderation tools treat dating apps like Twitter - zero tolerance, immediate bans. I built something that understands dating conversations are different.
Key features:
- Progressive warnings instead of instant bans
- "You're beautiful" scored as appropriate, not harassment
- Crisis intervention for self-harm (support, not punishment)
- Dual system: gentle handling for normal chat, escalation for real threats
- Interactive Streamlit demo with professional UI
- Robust error handling for AI safety filters and edge cases
- Consistent output parsing with graceful fallbacks
Dating apps lose users when moderation is too aggressive. Over-moderation kills engagement.
The business problem:
- False positives frustrate users into leaving
- Support tickets flood in from wrongly banned users
- Appeal processes waste time and money
Dating app context matters:
- "You're hot" between matched users isn't harassment
- Phone number requests are normal after conversation builds
- Hook-up language should score 2-3, not 8-9
Built two different prompts that route automatically:
Normal conversations → Gentle scoring with progressive enforcement Serious issues → Immediate escalation (hate speech, self-harm, fraud)
Evaluation process:
- Tested on 45+ real dating app messages
- Manual scoring to find false positive patterns
- Langfuse tracking for every decision
- Systematic prompt improvements based on failures
Crisis handling: Self-harm detection doesn't remove content - it provides mental health resources and notifies appropriate support teams.
Professional Streamlit interface with:
- Clean two-column layout with proper spacing
- Quick test buttons for common scenarios (Hate Speech, Self-Harm, Fraud)
- Loading states and success feedback
- Expandable technical analysis view
- Mobile-friendly responsive design
Quick tests available:
- Hate speech detection and scoring
- Self-harm crisis intervention
- Fraud/scam identification
- Normal dating conversation handling
Boundary-Pushing Content Analysis:

Crisis Intervention for Self-Harm:

- Python 3.8+
- OpenAI API key
- Langfuse account (for tracking)
hinge_moderation_v2.py- Main moderation engineweb_demo.py- Streamlit interfacehinge-terms-of-use.txt- Reference guidelines
- Fixed AI output consistency issues - Resolved parsing errors and format inconsistencies
- Optimized token usage - Reduced specialized prompts from 10,177 to ~744 tokens (93% reduction)
- Enhanced safety filter handling - Graceful responses when OpenAI safety systems trigger
- Improved hate speech detection - Enhanced keyword detection for more accurate routing
- Reduced false positives in severity scoring for dating app contexts
- Progressive enforcement maintains safety while improving user experience
- Crisis intervention provides support rather than punishment for self-harm content
- GPT-4 dual-prompt system for routing and analysis
- Optimized prompt engineering with token limit management
- Robust safety filter detection for OpenAI content policy triggers
- Langfuse integration for observability and improvement tracking
- Streamlit frontend with professional UI/UX and visual examples
- Consistent output parsing with improved format handling
- Session state management for interactive testing
- Policy Knowledge Base: Vector database of dating app guidelines and precedents
- Context-Aware Decisions: Retrieve relevant policy examples for consistent enforcement
- Appeal Case History: Learn from previous moderation decisions and outcomes
- Dynamic Policy Updates: Automatically incorporate new guidelines without code changes
- JSON Schema Validation: Guaranteed consistent API responses for production integration
- Typed Moderation Results: Structured data for downstream systems and analytics
- Audit Trail Format: Standardized logging for compliance and review processes
Built for AI Product Manager interviews - Demonstrates systematic approach to trust & safety, user experience focus, and technical implementation skills.
Stack: Python, GPT-4, Streamlit, Langfuse


