-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Summary
Implement intelligent, dynamic failover for model routing that automatically switches primary providers when specific models experience errors, while maintaining Tinfoil as the preferred primary when healthy.
Current Behavior
- Tinfoil is hardcoded as primary for all models it supports
- Failover to Continuum only happens per-request after timeout (90+ seconds)
- Temporary hardcoded swap for
llama3-3-70bto use Continuum as primary (see src/proxy_config.rs:342-366)
Desired Behavior
- Default Priority: Tinfoil should remain primary for all models it supports when healthy
- Health Detection: Monitor model-specific health per provider
- Automatic Failover: When a specific model on Tinfoil fails, automatically promote Continuum as primary for that model only
- Recovery: Periodically check if Tinfoil has recovered and restore it as primary
- Granular Control: Track health per model, not per provider (e.g.,
llama3-3-70bmight be down whiledeepseek-r1-0528works fine)
Proposed Implementation
1. Health Tracking System
struct ModelHealth {
provider: String,
model_id: String,
consecutive_failures: u32,
last_failure: Option<DateTime<Utc>>,
last_success: Option<DateTime<Utc>>,
is_healthy: bool,
}2. Dynamic Route Adjustment
- Track failures in
src/web/openai.rswhen handling requests - Update model routes in ProxyRouter based on health status
- Threshold-based switching (e.g., 3 consecutive failures = unhealthy)
3. Health Check Strategies
Option A: Passive Monitoring
- Track actual request failures/successes
- No additional traffic, but slower to detect recovery
- Could implement exponential backoff for retry attempts
Option B: Active Health Checks
- Periodic lightweight test requests to each model
- Faster recovery detection
- Additional traffic/cost considerations
Option C: Hybrid Approach
- Passive monitoring for failure detection
- Active health checks only for models marked unhealthy
- Balance between responsiveness and efficiency
4. Integration Points
- Modify
ProxyRouter::refresh_cache()to consider health status - Add health tracking to request handlers in
src/web/openai.rs - Store health state in memory with optional persistence
5. Configuration
[model_health]
failure_threshold = 3 # Consecutive failures before marking unhealthy
recovery_check_interval = 300 # Seconds between health checks for failed models
recovery_threshold = 2 # Consecutive successes before marking healthyImplementation Notes
- Should handle both connection errors (502, timeouts) and model-specific errors differently
- Consider implementing circuit breaker pattern for more sophisticated failure handling
- Log all provider switches for debugging/monitoring
- Ensure thread-safe access to health state
Acceptance Criteria
- Models automatically failover to Continuum when Tinfoil has issues
- Failed models automatically recover when Tinfoil is healthy again
- No manual intervention required for provider switching
- Health status visible in logs/metrics
- Configurable thresholds and intervals
- No performance degradation for healthy models
Related Code
src/proxy_config.rs: Current routing logicsrc/web/openai.rs: Request handling and fallback logic- Current temporary fix at src/proxy_config.rs:342-366
Priority
High - This directly impacts user experience when Tinfoil has intermittent issues
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels