Skip to content

Phase 8.1: Implement Timeout Policies and Resilience Testing #109

@artcava

Description

@artcava

📋 Task Description

Implement comprehensive timeout policies using Polly to prevent indefinite waiting on slow operations. Create extensive resilience tests covering retry, circuit breaker, and timeout scenarios with failure simulations and performance measurements.

🎯 Objectives

  • Implement timeout policies for HTTP clients
  • Implement timeout policies for database operations
  • Implement timeout policies for message broker
  • Configure pessimistic and optimistic timeout strategies
  • Integrate timeout policies with retry and circuit breaker (triple wrap)
  • Add timeout telemetry and logging
  • Write comprehensive resilience integration tests
  • Simulate various failure scenarios
  • Measure resilience policy overhead
  • Test policy interaction (retry + circuit breaker + timeout)
  • Document resilience testing strategy
  • Create chaos testing scenarios

📦 Deliverables

1. Create Timeout Configuration

Create src/StarGate.Infrastructure/Resilience/TimeoutConfiguration.cs:

namespace StarGate.Infrastructure.Resilience;

/// <summary>
/// Configuration for timeout policies.
/// </summary>
public class TimeoutConfiguration
{
    /// <summary>
    /// Timeout for HTTP requests (seconds).
    /// </summary>
    public double HttpTimeoutSeconds { get; set; } = 30.0;

    /// <summary>
    /// Timeout for database operations (seconds).
    /// </summary>
    public double DatabaseTimeoutSeconds { get; set; } = 10.0;

    /// <summary>
    /// Timeout for message broker operations (seconds).
    /// </summary>
    public double BrokerTimeoutSeconds { get; set; } = 5.0;

    /// <summary>
    /// Whether to use pessimistic timeout (cancels operation).
    /// If false, uses optimistic timeout (monitors but doesn't cancel).
    /// </summary>
    public bool UsePessimisticTimeout { get; set; } = true;

    /// <summary>
    /// Gets HTTP timeout as TimeSpan.
    /// </summary>
    public TimeSpan HttpTimeout => TimeSpan.FromSeconds(HttpTimeoutSeconds);

    /// <summary>
    /// Gets database timeout as TimeSpan.
    /// </summary>
    public TimeSpan DatabaseTimeout => TimeSpan.FromSeconds(DatabaseTimeoutSeconds);

    /// <summary>
    /// Gets broker timeout as TimeSpan.
    /// </summary>
    public TimeSpan BrokerTimeout => TimeSpan.FromSeconds(BrokerTimeoutSeconds);
}

2. Create Timeout Policy Factory

Create src/StarGate.Infrastructure/Resilience/TimeoutPolicyFactory.cs:

namespace StarGate.Infrastructure.Resilience;

using Microsoft.Extensions.Logging;
using Polly;
using Polly.Timeout;

/// <summary>
/// Factory for creating Polly timeout policies.
/// </summary>
public static class TimeoutPolicyFactory
{
    /// <summary>
    /// Creates a timeout policy for HTTP operations.
    /// </summary>
    public static AsyncTimeoutPolicy<HttpResponseMessage> CreateHttpTimeoutPolicy(
        TimeoutConfiguration config,
        ILogger logger)
    {
        return Policy
            .TimeoutAsync<HttpResponseMessage>(
                timeout: config.HttpTimeout,
                timeoutStrategy: config.UsePessimisticTimeout
                    ? TimeoutStrategy.Pessimistic
                    : TimeoutStrategy.Optimistic,
                onTimeoutAsync: (context, timespan, task) =>
                {
                    logger.LogError(
                        "HTTP operation timed out: Timeout={Timeout}s, Strategy={Strategy}",
                        timespan.TotalSeconds,
                        config.UsePessimisticTimeout ? "Pessimistic" : "Optimistic");

                    return Task.CompletedTask;
                });
    }

    /// <summary>
    /// Creates a timeout policy for database operations.
    /// </summary>
    public static AsyncTimeoutPolicy CreateDatabaseTimeoutPolicy(
        TimeoutConfiguration config,
        ILogger logger)
    {
        return Policy
            .TimeoutAsync(
                timeout: config.DatabaseTimeout,
                timeoutStrategy: config.UsePessimisticTimeout
                    ? TimeoutStrategy.Pessimistic
                    : TimeoutStrategy.Optimistic,
                onTimeoutAsync: (context, timespan, task) =>
                {
                    logger.LogError(
                        "Database operation timed out: Timeout={Timeout}s, Strategy={Strategy}",
                        timespan.TotalSeconds,
                        config.UsePessimisticTimeout ? "Pessimistic" : "Optimistic");

                    return Task.CompletedTask;
                });
    }

    /// <summary>
    /// Creates a timeout policy for message broker operations.
    /// </summary>
    public static AsyncTimeoutPolicy CreateBrokerTimeoutPolicy(
        TimeoutConfiguration config,
        ILogger logger)
    {
        return Policy
            .TimeoutAsync(
                timeout: config.BrokerTimeout,
                timeoutStrategy: config.UsePessimisticTimeout
                    ? TimeoutStrategy.Pessimistic
                    : TimeoutStrategy.Optimistic,
                onTimeoutAsync: (context, timespan, task) =>
                {
                    logger.LogError(
                        "Broker operation timed out: Timeout={Timeout}s, Strategy={Strategy}",
                        timespan.TotalSeconds,
                        config.UsePessimisticTimeout ? "Pessimistic" : "Optimistic");

                    return Task.CompletedTask;
                });
    }
}

3. Update Resilience Policy Wrapper

Update src/StarGate.Infrastructure/Resilience/ResiliencePolicyWrapper.cs:

/// <summary>
/// Creates a complete resilience policy with timeout, circuit breaker, and retry.
/// </summary>
public static AsyncPolicyWrap<HttpResponseMessage> CreateCompleteHttpResiliencePolicy(
    TimeoutConfiguration timeoutConfig,
    RetryPolicyConfiguration retryConfig,
    CircuitBreakerConfiguration circuitConfig,
    ILogger logger)
{
    var timeoutPolicy = TimeoutPolicyFactory.CreateHttpTimeoutPolicy(timeoutConfig, logger);
    var retryPolicy = RetryPolicyFactory.CreateHttpRetryPolicy(retryConfig, logger);
    var circuitBreaker = CircuitBreakerFactory.CreateHttpCircuitBreaker(circuitConfig, logger);

    // Wrap: Timeout (outer) -> Circuit Breaker -> Retry (inner)
    return Policy.WrapAsync(timeoutPolicy, circuitBreaker, retryPolicy);
}

/// <summary>
/// Creates a complete resilience policy for database operations.
/// </summary>
public static AsyncPolicyWrap CreateCompleteDatabaseResiliencePolicy(
    TimeoutConfiguration timeoutConfig,
    RetryPolicyConfiguration retryConfig,
    CircuitBreakerConfiguration circuitConfig,
    ILogger logger)
{
    var timeoutPolicy = TimeoutPolicyFactory.CreateDatabaseTimeoutPolicy(timeoutConfig, logger);
    var retryPolicy = RetryPolicyFactory.CreateDatabaseRetryPolicy(retryConfig, logger);
    var circuitBreaker = CircuitBreakerFactory.CreateDatabaseCircuitBreaker(circuitConfig, logger);

    return Policy.WrapAsync(timeoutPolicy, circuitBreaker, retryPolicy);
}

/// <summary>
/// Creates a complete resilience policy for broker operations.
/// </summary>
public static AsyncPolicyWrap CreateCompleteBrokerResiliencePolicy(
    TimeoutConfiguration timeoutConfig,
    RetryPolicyConfiguration retryConfig,
    CircuitBreakerConfiguration circuitConfig,
    ILogger logger)
{
    var timeoutPolicy = TimeoutPolicyFactory.CreateBrokerTimeoutPolicy(timeoutConfig, logger);
    var retryPolicy = RetryPolicyFactory.CreateBrokerRetryPolicy(retryConfig, logger);
    var circuitBreaker = CircuitBreakerFactory.CreateBrokerCircuitBreaker(circuitConfig, logger);

    return Policy.WrapAsync(timeoutPolicy, circuitBreaker, retryPolicy);
}

4. Update Configuration

Update src/StarGate.Server/appsettings.json:

{
  "Resilience": {
    "Timeout": {
      "HttpTimeoutSeconds": 30.0,
      "DatabaseTimeoutSeconds": 10.0,
      "BrokerTimeoutSeconds": 5.0,
      "UsePessimisticTimeout": true
    },
    "Retry": {
      "MaxRetryAttempts": 3,
      "InitialDelaySeconds": 1.0,
      "MaxDelaySeconds": 30.0,
      "BackoffMultiplier": 2.0,
      "UseJitter": true
    },
    "CircuitBreaker": {
      "FailureThreshold": 5,
      "FailureRateThreshold": 0.5,
      "MinimumThroughput": 10,
      "BreakDurationSeconds": 30.0,
      "SamplingDurationSeconds": 60.0
    }
  }
}

5. Update Resilience Extensions

Update src/StarGate.Infrastructure/Extensions/ResilienceServiceCollectionExtensions.cs:

public static IServiceCollection AddResiliencePolicies(
    this IServiceCollection services,
    IConfiguration configuration)
{
    // Register configurations
    services.Configure<TimeoutConfiguration>(
        configuration.GetSection("Resilience:Timeout"));
    services.Configure<RetryPolicyConfiguration>(
        configuration.GetSection("Resilience:Retry"));
    services.Configure<CircuitBreakerConfiguration>(
        configuration.GetSection("Resilience:CircuitBreaker"));

    // Register complete wrapped resilience policies
    services.AddSingleton<AsyncPolicyWrap>(provider =>
    {
        var timeoutConfig = provider.GetRequiredService<IOptions<TimeoutConfiguration>>().Value;
        var retryConfig = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
        var circuitConfig = provider.GetRequiredService<IOptions<CircuitBreakerConfiguration>>().Value;
        var logger = provider.GetRequiredService<ILogger<ResiliencePolicyWrapper>>();
        return ResiliencePolicyWrapper.CreateCompleteDatabaseResiliencePolicy(
            timeoutConfig, retryConfig, circuitConfig, logger);
    });

    services.AddSingleton<AsyncPolicyWrap>(provider =>
    {
        var timeoutConfig = provider.GetRequiredService<IOptions<TimeoutConfiguration>>().Value;
        var retryConfig = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
        var circuitConfig = provider.GetRequiredService<IOptions<CircuitBreakerConfiguration>>().Value;
        var logger = provider.GetRequiredService<ILogger<ResiliencePolicyWrapper>>();
        return ResiliencePolicyWrapper.CreateCompleteBrokerResiliencePolicy(
            timeoutConfig, retryConfig, circuitConfig, logger);
    });

    return services;
}

6. Create Resilience Integration Tests

Create tests/StarGate.IntegrationTests/Resilience/ResilienceIntegrationTests.cs:

namespace StarGate.IntegrationTests.Resilience;

using FluentAssertions;
using Microsoft.Extensions.DependencyInjection;
using Polly.CircuitBreaker;
using StarGate.Infrastructure.Resilience;
using Xunit;

public class ResilienceIntegrationTests : IClassFixture<WebApplicationFactory>
{
    private readonly WebApplicationFactory _factory;

    public ResilienceIntegrationTests(WebApplicationFactory factory)
    {
        _factory = factory;
    }

    [Fact]
    public async Task Should_RetryOnTransientFailures()
    {
        // Test retry policy with simulated transient failures
        // Implement test logic
    }

    [Fact]
    public async Task Should_OpenCircuitAfterThreshold()
    {
        // Test circuit breaker opening after threshold
        // Implement test logic
    }

    [Fact]
    public async Task Should_TimeoutSlowOperations()
    {
        // Test timeout policy on slow operations
        // Implement test logic
    }

    [Fact]
    public async Task Should_CombineAllPoliciesCorrectly()
    {
        // Test interaction of timeout, circuit breaker, and retry
        // Implement test logic
    }
}

Create tests/StarGate.IntegrationTests/Resilience/ChaosTests.cs:

namespace StarGate.IntegrationTests.Resilience;

using Xunit;

/// <summary>
/// Chaos testing scenarios for resilience validation.
/// </summary>
public class ChaosTests : IClassFixture<WebApplicationFactory>
{
    private readonly WebApplicationFactory _factory;

    public ChaosTests(WebApplicationFactory factory)
    {
        _factory = factory;
    }

    [Fact]
    public async Task ChaosScenario_DatabaseIntermittentFailures()
    {
        // Simulate random database failures (30% failure rate)
        // Verify retry handles intermittent failures
        // Measure success rate and latency
    }

    [Fact]
    public async Task ChaosScenario_DatabaseProlongedOutage()
    {
        // Simulate complete database unavailability
        // Verify circuit breaker opens
        // Verify fail-fast behavior
        // Verify recovery when database restored
    }

    [Fact]
    public async Task ChaosScenario_BrokerSlowResponses()
    {
        // Simulate slow broker responses (>timeout)
        // Verify timeout policy activates
        // Measure performance impact
    }

    [Fact]
    public async Task ChaosScenario_NetworkPartition()
    {
        // Simulate network issues (timeouts, connection errors)
        // Verify all policies work together
        // Measure system degradation
    }

    [Fact]
    public async Task ChaosScenario_HighLoad()
    {
        // Simulate high load with varying failure rates
        // Verify circuit breaker protects system
        // Measure throughput with/without policies
    }
}

7. Create Performance Tests

Create tests/StarGate.PerformanceTests/ResiliencePolicyOverheadTests.cs:

namespace StarGate.PerformanceTests;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
[SimpleJob(warmupCount: 3, targetCount: 10)]
public class ResiliencePolicyOverheadTests
{
    [Benchmark(Baseline = true)]
    public async Task Operation_WithoutPolicies()
    {
        // Measure baseline performance
        await Task.Delay(10);
    }

    [Benchmark]
    public async Task Operation_WithRetryPolicy()
    {
        // Measure overhead with retry policy
    }

    [Benchmark]
    public async Task Operation_WithCircuitBreaker()
    {
        // Measure overhead with circuit breaker
    }

    [Benchmark]
    public async Task Operation_WithTimeout()
    {
        // Measure overhead with timeout
    }

    [Benchmark]
    public async Task Operation_WithAllPolicies()
    {
        // Measure overhead with complete policy stack
    }
}

8. Create Resilience Documentation

Create docs/RESILIENCE-STRATEGY.md:

# Resilience Strategy

## Overview

StarGate implements comprehensive resilience patterns using Polly to handle failures gracefully.

## Policies Implemented

### 1. Retry Policy
- **Purpose:** Handle transient failures
- **Strategy:** Exponential backoff with jitter
- **Max Attempts:** 3
- **Delays:** 1s, 2s, 4s (+/- 10% jitter)

### 2. Circuit Breaker
- **Purpose:** Prevent cascading failures
- **Threshold:** 50% failure rate
- **Minimum Throughput:** 10 requests
- **Break Duration:** 30 seconds

### 3. Timeout
- **Purpose:** Prevent indefinite waiting
- **HTTP:** 30 seconds
- **Database:** 10 seconds
- **Broker:** 5 seconds

## Policy Combination

Timeout (outer)

Circuit Breaker

Retry (inner)

Operation


## Configuration

See `appsettings.json` for configuration options.

## Monitoring

Check health endpoint: `/health`

## Testing

See `tests/StarGate.IntegrationTests/Resilience/` for test scenarios.

✅ Acceptance Criteria

  • TimeoutConfiguration implemented
  • TimeoutPolicyFactory created for HTTP, database, and broker
  • Pessimistic and optimistic timeout strategies supported
  • Complete resilience policy wrapper (timeout + circuit breaker + retry)
  • All configurations registered in DI container
  • Configuration files updated with timeout settings
  • Comprehensive logging for timeout events
  • Resilience integration tests implemented
  • Chaos testing scenarios created
  • Performance overhead tests created
  • Resilience strategy documented
  • Health checks reflect all policy states
  • Code follows CODING-CONVENTIONS.md

📝 Testing Instructions

# Run unit tests
dotnet test tests/StarGate.Infrastructure.Tests --filter "FullyQualifiedName~Timeout"

# Run integration tests
dotnet test tests/StarGate.IntegrationTests --filter "FullyQualifiedName~Resilience"

# Run chaos tests
dotnet test tests/StarGate.IntegrationTests --filter "FullyQualifiedName~Chaos"

# Run performance tests
cd tests/StarGate.PerformanceTests
dotnet run -c Release

# Test timeout with slow database
# 1. Add artificial delay in MongoDB query
# 2. Verify timeout triggers
# 3. Check logs: "Database operation timed out: Timeout=10s"

# Test complete policy stack
# 1. Stop MongoDB
# 2. Create process (should retry)
# 3. After 10+ failures, circuit opens
# 4. Requests fail immediately with timeout
# 5. Wait 30s for half-open
# 6. Restart MongoDB
# 7. Circuit closes on success

# Measure policy overhead
# Compare performance:
# - Without policies: baseline
# - With retry only: +0.5ms
# - With circuit breaker: +0.3ms
# - With timeout: +0.2ms
# - With all: +1.0ms

# Verify health endpoint
curl http://localhost:5000/health | jq
# Should show:
# - All policies configured
# - Circuit states
# - Recent failures/successes

📚 References

🏷️ Labels

phase-8 resilience sprint-8.1 polly timeout testing

⏱️ Estimated Effort

10-12 hours

🔗 Dependencies

🔗 Related Issues

Part of Phase 8: Resilience - Sprint 8.1: Polly Integration

📌 Important Notes

Policy Wrapping Order

Timeout (outer) → Must be outermost
  ↓               Total operation timeout
Circuit Breaker → Protects from cascading failures
  ↓               Fails fast when open
Retry (inner) →   Handles transient failures
  ↓               Multiple attempts
Operation →       Actual work

Why this order?

  1. Timeout outermost: Ensures total operation time bounded
  2. Circuit breaker middle: Prevents retries when service is down
  3. Retry innermost: Each retry respects circuit state

Pessimistic vs Optimistic Timeout

Pessimistic (Default):

TimeoutStrategy.Pessimistic
  • Cancels operation via CancellationToken
  • Forces operation to stop
  • Requires operation to respect token
  • Recommended for most cases

Optimistic:

TimeoutStrategy.Optimistic
  • Monitors operation duration
  • Doesn't cancel operation
  • Operation continues in background
  • Use only when cancellation not possible

Timeout Values Rationale

HTTP: 30 seconds

  • External API calls
  • May include network latency
  • Allows for slow responses

Database: 10 seconds

  • Local network (low latency)
  • Query should be fast
  • Longer suggests query issue

Broker: 5 seconds

  • Local network
  • Should be very fast
  • Longer suggests connection issue

Testing Strategy

Unit Tests:

  • Test policy configuration
  • Test timeout calculation
  • Mock slow operations

Integration Tests:

  • Test with real infrastructure
  • Simulate slow responses
  • Verify timeout triggers

Chaos Tests:

  • Random failures
  • Prolonged outages
  • Network issues
  • High load

Performance Tests:

  • Measure overhead
  • Compare scenarios
  • Identify bottlenecks

Performance Overhead

Expected Overhead (Success Case):

  • Retry: ~0.5ms (tracking state)
  • Circuit Breaker: ~0.3ms (state check)
  • Timeout: ~0.2ms (timer setup)
  • Total: ~1ms (acceptable)

Failure Case:

  • Retry: +7s (1s + 2s + 4s delays)
  • Circuit Breaker: Fail immediately when open
  • Timeout: Fail at timeout threshold

Monitoring Recommendations

Key Metrics:

  • Policy execution count
  • Success/failure rate
  • Retry attempts
  • Circuit state changes
  • Timeout occurrences
  • Average latency

Dashboards:

  • Policy health overview
  • Circuit breaker states
  • Timeout trends
  • Performance impact

Alerts:

  • High timeout rate
  • Circuit opened
  • High retry rate
  • Degraded performance

Production Considerations

Configuration Tuning:

  • Start conservative
  • Monitor metrics
  • Adjust based on behavior
  • Different values per environment

Gradual Rollout:

  1. Enable retry only
  2. Add circuit breaker
  3. Add timeout
  4. Monitor each step

Fallback Strategies:

  • Cached responses
  • Default values
  • Degraded mode
  • Clear error messages

Chaos Engineering

Regularly test:

  • Random infrastructure failures
  • Network latency injection
  • Slow response simulation
  • High load scenarios
  • Resource exhaustion

Tools:

  • Toxiproxy (network chaos)
  • Simmy (chaos policies)
  • Custom failure injection

Documentation Requirements

For Developers:

  • Policy configuration guide
  • Testing guidelines
  • Troubleshooting tips
  • Best practices

For Operations:

  • Monitoring setup
  • Alert configuration
  • Incident response
  • Performance tuning

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions