Skip to content

Implement dynamic model routing based on provider health monitoring #87

@AnthonyRonning

Description

@AnthonyRonning

Summary

Implement intelligent, dynamic failover for model routing that automatically switches primary providers when specific models experience errors, while maintaining Tinfoil as the preferred primary when healthy.

Current Behavior

  • Tinfoil is hardcoded as primary for all models it supports
  • Failover to Continuum only happens per-request after timeout (90+ seconds)
  • Temporary hardcoded swap for llama3-3-70b to use Continuum as primary (see src/proxy_config.rs:342-366)

Desired Behavior

  1. Default Priority: Tinfoil should remain primary for all models it supports when healthy
  2. Health Detection: Monitor model-specific health per provider
  3. Automatic Failover: When a specific model on Tinfoil fails, automatically promote Continuum as primary for that model only
  4. Recovery: Periodically check if Tinfoil has recovered and restore it as primary
  5. Granular Control: Track health per model, not per provider (e.g., llama3-3-70b might be down while deepseek-r1-0528 works fine)

Proposed Implementation

1. Health Tracking System

struct ModelHealth {
    provider: String,
    model_id: String,
    consecutive_failures: u32,
    last_failure: Option<DateTime<Utc>>,
    last_success: Option<DateTime<Utc>>,
    is_healthy: bool,
}

2. Dynamic Route Adjustment

  • Track failures in src/web/openai.rs when handling requests
  • Update model routes in ProxyRouter based on health status
  • Threshold-based switching (e.g., 3 consecutive failures = unhealthy)

3. Health Check Strategies

Option A: Passive Monitoring

  • Track actual request failures/successes
  • No additional traffic, but slower to detect recovery
  • Could implement exponential backoff for retry attempts

Option B: Active Health Checks

  • Periodic lightweight test requests to each model
  • Faster recovery detection
  • Additional traffic/cost considerations

Option C: Hybrid Approach

  • Passive monitoring for failure detection
  • Active health checks only for models marked unhealthy
  • Balance between responsiveness and efficiency

4. Integration Points

  • Modify ProxyRouter::refresh_cache() to consider health status
  • Add health tracking to request handlers in src/web/openai.rs
  • Store health state in memory with optional persistence

5. Configuration

[model_health]
failure_threshold = 3  # Consecutive failures before marking unhealthy
recovery_check_interval = 300  # Seconds between health checks for failed models
recovery_threshold = 2  # Consecutive successes before marking healthy

Implementation Notes

  • Should handle both connection errors (502, timeouts) and model-specific errors differently
  • Consider implementing circuit breaker pattern for more sophisticated failure handling
  • Log all provider switches for debugging/monitoring
  • Ensure thread-safe access to health state

Acceptance Criteria

  • Models automatically failover to Continuum when Tinfoil has issues
  • Failed models automatically recover when Tinfoil is healthy again
  • No manual intervention required for provider switching
  • Health status visible in logs/metrics
  • Configurable thresholds and intervals
  • No performance degradation for healthy models

Related Code

  • src/proxy_config.rs: Current routing logic
  • src/web/openai.rs: Request handling and fallback logic
  • Current temporary fix at src/proxy_config.rs:342-366

Priority

High - This directly impacts user experience when Tinfoil has intermittent issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions