📋 Task Description
Implement comprehensive timeout policies using Polly to prevent indefinite waiting on slow operations. Create extensive resilience tests covering retry, circuit breaker, and timeout scenarios with failure simulations and performance measurements.
🎯 Objectives
- Implement timeout policies for HTTP clients
- Implement timeout policies for database operations
- Implement timeout policies for message broker
- Configure pessimistic and optimistic timeout strategies
- Integrate timeout policies with retry and circuit breaker (triple wrap)
- Add timeout telemetry and logging
- Write comprehensive resilience integration tests
- Simulate various failure scenarios
- Measure resilience policy overhead
- Test policy interaction (retry + circuit breaker + timeout)
- Document resilience testing strategy
- Create chaos testing scenarios
📦 Deliverables
1. Create Timeout Configuration
Create src/StarGate.Infrastructure/Resilience/TimeoutConfiguration.cs:
namespace StarGate.Infrastructure.Resilience;
/// <summary>
/// Configuration for timeout policies.
/// </summary>
public class TimeoutConfiguration
{
/// <summary>
/// Timeout for HTTP requests (seconds).
/// </summary>
public double HttpTimeoutSeconds { get; set; } = 30.0;
/// <summary>
/// Timeout for database operations (seconds).
/// </summary>
public double DatabaseTimeoutSeconds { get; set; } = 10.0;
/// <summary>
/// Timeout for message broker operations (seconds).
/// </summary>
public double BrokerTimeoutSeconds { get; set; } = 5.0;
/// <summary>
/// Whether to use pessimistic timeout (cancels operation).
/// If false, uses optimistic timeout (monitors but doesn't cancel).
/// </summary>
public bool UsePessimisticTimeout { get; set; } = true;
/// <summary>
/// Gets HTTP timeout as TimeSpan.
/// </summary>
public TimeSpan HttpTimeout => TimeSpan.FromSeconds(HttpTimeoutSeconds);
/// <summary>
/// Gets database timeout as TimeSpan.
/// </summary>
public TimeSpan DatabaseTimeout => TimeSpan.FromSeconds(DatabaseTimeoutSeconds);
/// <summary>
/// Gets broker timeout as TimeSpan.
/// </summary>
public TimeSpan BrokerTimeout => TimeSpan.FromSeconds(BrokerTimeoutSeconds);
}
2. Create Timeout Policy Factory
Create src/StarGate.Infrastructure/Resilience/TimeoutPolicyFactory.cs:
namespace StarGate.Infrastructure.Resilience;
using Microsoft.Extensions.Logging;
using Polly;
using Polly.Timeout;
/// <summary>
/// Factory for creating Polly timeout policies.
/// </summary>
public static class TimeoutPolicyFactory
{
/// <summary>
/// Creates a timeout policy for HTTP operations.
/// </summary>
public static AsyncTimeoutPolicy<HttpResponseMessage> CreateHttpTimeoutPolicy(
TimeoutConfiguration config,
ILogger logger)
{
return Policy
.TimeoutAsync<HttpResponseMessage>(
timeout: config.HttpTimeout,
timeoutStrategy: config.UsePessimisticTimeout
? TimeoutStrategy.Pessimistic
: TimeoutStrategy.Optimistic,
onTimeoutAsync: (context, timespan, task) =>
{
logger.LogError(
"HTTP operation timed out: Timeout={Timeout}s, Strategy={Strategy}",
timespan.TotalSeconds,
config.UsePessimisticTimeout ? "Pessimistic" : "Optimistic");
return Task.CompletedTask;
});
}
/// <summary>
/// Creates a timeout policy for database operations.
/// </summary>
public static AsyncTimeoutPolicy CreateDatabaseTimeoutPolicy(
TimeoutConfiguration config,
ILogger logger)
{
return Policy
.TimeoutAsync(
timeout: config.DatabaseTimeout,
timeoutStrategy: config.UsePessimisticTimeout
? TimeoutStrategy.Pessimistic
: TimeoutStrategy.Optimistic,
onTimeoutAsync: (context, timespan, task) =>
{
logger.LogError(
"Database operation timed out: Timeout={Timeout}s, Strategy={Strategy}",
timespan.TotalSeconds,
config.UsePessimisticTimeout ? "Pessimistic" : "Optimistic");
return Task.CompletedTask;
});
}
/// <summary>
/// Creates a timeout policy for message broker operations.
/// </summary>
public static AsyncTimeoutPolicy CreateBrokerTimeoutPolicy(
TimeoutConfiguration config,
ILogger logger)
{
return Policy
.TimeoutAsync(
timeout: config.BrokerTimeout,
timeoutStrategy: config.UsePessimisticTimeout
? TimeoutStrategy.Pessimistic
: TimeoutStrategy.Optimistic,
onTimeoutAsync: (context, timespan, task) =>
{
logger.LogError(
"Broker operation timed out: Timeout={Timeout}s, Strategy={Strategy}",
timespan.TotalSeconds,
config.UsePessimisticTimeout ? "Pessimistic" : "Optimistic");
return Task.CompletedTask;
});
}
}
3. Update Resilience Policy Wrapper
Update src/StarGate.Infrastructure/Resilience/ResiliencePolicyWrapper.cs:
/// <summary>
/// Creates a complete resilience policy with timeout, circuit breaker, and retry.
/// </summary>
public static AsyncPolicyWrap<HttpResponseMessage> CreateCompleteHttpResiliencePolicy(
TimeoutConfiguration timeoutConfig,
RetryPolicyConfiguration retryConfig,
CircuitBreakerConfiguration circuitConfig,
ILogger logger)
{
var timeoutPolicy = TimeoutPolicyFactory.CreateHttpTimeoutPolicy(timeoutConfig, logger);
var retryPolicy = RetryPolicyFactory.CreateHttpRetryPolicy(retryConfig, logger);
var circuitBreaker = CircuitBreakerFactory.CreateHttpCircuitBreaker(circuitConfig, logger);
// Wrap: Timeout (outer) -> Circuit Breaker -> Retry (inner)
return Policy.WrapAsync(timeoutPolicy, circuitBreaker, retryPolicy);
}
/// <summary>
/// Creates a complete resilience policy for database operations.
/// </summary>
public static AsyncPolicyWrap CreateCompleteDatabaseResiliencePolicy(
TimeoutConfiguration timeoutConfig,
RetryPolicyConfiguration retryConfig,
CircuitBreakerConfiguration circuitConfig,
ILogger logger)
{
var timeoutPolicy = TimeoutPolicyFactory.CreateDatabaseTimeoutPolicy(timeoutConfig, logger);
var retryPolicy = RetryPolicyFactory.CreateDatabaseRetryPolicy(retryConfig, logger);
var circuitBreaker = CircuitBreakerFactory.CreateDatabaseCircuitBreaker(circuitConfig, logger);
return Policy.WrapAsync(timeoutPolicy, circuitBreaker, retryPolicy);
}
/// <summary>
/// Creates a complete resilience policy for broker operations.
/// </summary>
public static AsyncPolicyWrap CreateCompleteBrokerResiliencePolicy(
TimeoutConfiguration timeoutConfig,
RetryPolicyConfiguration retryConfig,
CircuitBreakerConfiguration circuitConfig,
ILogger logger)
{
var timeoutPolicy = TimeoutPolicyFactory.CreateBrokerTimeoutPolicy(timeoutConfig, logger);
var retryPolicy = RetryPolicyFactory.CreateBrokerRetryPolicy(retryConfig, logger);
var circuitBreaker = CircuitBreakerFactory.CreateBrokerCircuitBreaker(circuitConfig, logger);
return Policy.WrapAsync(timeoutPolicy, circuitBreaker, retryPolicy);
}
4. Update Configuration
Update src/StarGate.Server/appsettings.json:
{
"Resilience": {
"Timeout": {
"HttpTimeoutSeconds": 30.0,
"DatabaseTimeoutSeconds": 10.0,
"BrokerTimeoutSeconds": 5.0,
"UsePessimisticTimeout": true
},
"Retry": {
"MaxRetryAttempts": 3,
"InitialDelaySeconds": 1.0,
"MaxDelaySeconds": 30.0,
"BackoffMultiplier": 2.0,
"UseJitter": true
},
"CircuitBreaker": {
"FailureThreshold": 5,
"FailureRateThreshold": 0.5,
"MinimumThroughput": 10,
"BreakDurationSeconds": 30.0,
"SamplingDurationSeconds": 60.0
}
}
}
5. Update Resilience Extensions
Update src/StarGate.Infrastructure/Extensions/ResilienceServiceCollectionExtensions.cs:
public static IServiceCollection AddResiliencePolicies(
this IServiceCollection services,
IConfiguration configuration)
{
// Register configurations
services.Configure<TimeoutConfiguration>(
configuration.GetSection("Resilience:Timeout"));
services.Configure<RetryPolicyConfiguration>(
configuration.GetSection("Resilience:Retry"));
services.Configure<CircuitBreakerConfiguration>(
configuration.GetSection("Resilience:CircuitBreaker"));
// Register complete wrapped resilience policies
services.AddSingleton<AsyncPolicyWrap>(provider =>
{
var timeoutConfig = provider.GetRequiredService<IOptions<TimeoutConfiguration>>().Value;
var retryConfig = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
var circuitConfig = provider.GetRequiredService<IOptions<CircuitBreakerConfiguration>>().Value;
var logger = provider.GetRequiredService<ILogger<ResiliencePolicyWrapper>>();
return ResiliencePolicyWrapper.CreateCompleteDatabaseResiliencePolicy(
timeoutConfig, retryConfig, circuitConfig, logger);
});
services.AddSingleton<AsyncPolicyWrap>(provider =>
{
var timeoutConfig = provider.GetRequiredService<IOptions<TimeoutConfiguration>>().Value;
var retryConfig = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
var circuitConfig = provider.GetRequiredService<IOptions<CircuitBreakerConfiguration>>().Value;
var logger = provider.GetRequiredService<ILogger<ResiliencePolicyWrapper>>();
return ResiliencePolicyWrapper.CreateCompleteBrokerResiliencePolicy(
timeoutConfig, retryConfig, circuitConfig, logger);
});
return services;
}
6. Create Resilience Integration Tests
Create tests/StarGate.IntegrationTests/Resilience/ResilienceIntegrationTests.cs:
namespace StarGate.IntegrationTests.Resilience;
using FluentAssertions;
using Microsoft.Extensions.DependencyInjection;
using Polly.CircuitBreaker;
using StarGate.Infrastructure.Resilience;
using Xunit;
public class ResilienceIntegrationTests : IClassFixture<WebApplicationFactory>
{
private readonly WebApplicationFactory _factory;
public ResilienceIntegrationTests(WebApplicationFactory factory)
{
_factory = factory;
}
[Fact]
public async Task Should_RetryOnTransientFailures()
{
// Test retry policy with simulated transient failures
// Implement test logic
}
[Fact]
public async Task Should_OpenCircuitAfterThreshold()
{
// Test circuit breaker opening after threshold
// Implement test logic
}
[Fact]
public async Task Should_TimeoutSlowOperations()
{
// Test timeout policy on slow operations
// Implement test logic
}
[Fact]
public async Task Should_CombineAllPoliciesCorrectly()
{
// Test interaction of timeout, circuit breaker, and retry
// Implement test logic
}
}
Create tests/StarGate.IntegrationTests/Resilience/ChaosTests.cs:
namespace StarGate.IntegrationTests.Resilience;
using Xunit;
/// <summary>
/// Chaos testing scenarios for resilience validation.
/// </summary>
public class ChaosTests : IClassFixture<WebApplicationFactory>
{
private readonly WebApplicationFactory _factory;
public ChaosTests(WebApplicationFactory factory)
{
_factory = factory;
}
[Fact]
public async Task ChaosScenario_DatabaseIntermittentFailures()
{
// Simulate random database failures (30% failure rate)
// Verify retry handles intermittent failures
// Measure success rate and latency
}
[Fact]
public async Task ChaosScenario_DatabaseProlongedOutage()
{
// Simulate complete database unavailability
// Verify circuit breaker opens
// Verify fail-fast behavior
// Verify recovery when database restored
}
[Fact]
public async Task ChaosScenario_BrokerSlowResponses()
{
// Simulate slow broker responses (>timeout)
// Verify timeout policy activates
// Measure performance impact
}
[Fact]
public async Task ChaosScenario_NetworkPartition()
{
// Simulate network issues (timeouts, connection errors)
// Verify all policies work together
// Measure system degradation
}
[Fact]
public async Task ChaosScenario_HighLoad()
{
// Simulate high load with varying failure rates
// Verify circuit breaker protects system
// Measure throughput with/without policies
}
}
7. Create Performance Tests
Create tests/StarGate.PerformanceTests/ResiliencePolicyOverheadTests.cs:
namespace StarGate.PerformanceTests;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
[MemoryDiagnoser]
[SimpleJob(warmupCount: 3, targetCount: 10)]
public class ResiliencePolicyOverheadTests
{
[Benchmark(Baseline = true)]
public async Task Operation_WithoutPolicies()
{
// Measure baseline performance
await Task.Delay(10);
}
[Benchmark]
public async Task Operation_WithRetryPolicy()
{
// Measure overhead with retry policy
}
[Benchmark]
public async Task Operation_WithCircuitBreaker()
{
// Measure overhead with circuit breaker
}
[Benchmark]
public async Task Operation_WithTimeout()
{
// Measure overhead with timeout
}
[Benchmark]
public async Task Operation_WithAllPolicies()
{
// Measure overhead with complete policy stack
}
}
8. Create Resilience Documentation
Create docs/RESILIENCE-STRATEGY.md:
# Resilience Strategy
## Overview
StarGate implements comprehensive resilience patterns using Polly to handle failures gracefully.
## Policies Implemented
### 1. Retry Policy
- **Purpose:** Handle transient failures
- **Strategy:** Exponential backoff with jitter
- **Max Attempts:** 3
- **Delays:** 1s, 2s, 4s (+/- 10% jitter)
### 2. Circuit Breaker
- **Purpose:** Prevent cascading failures
- **Threshold:** 50% failure rate
- **Minimum Throughput:** 10 requests
- **Break Duration:** 30 seconds
### 3. Timeout
- **Purpose:** Prevent indefinite waiting
- **HTTP:** 30 seconds
- **Database:** 10 seconds
- **Broker:** 5 seconds
## Policy Combination
Timeout (outer)
↓
Circuit Breaker
↓
Retry (inner)
↓
Operation
## Configuration
See `appsettings.json` for configuration options.
## Monitoring
Check health endpoint: `/health`
## Testing
See `tests/StarGate.IntegrationTests/Resilience/` for test scenarios.
✅ Acceptance Criteria
📝 Testing Instructions
# Run unit tests
dotnet test tests/StarGate.Infrastructure.Tests --filter "FullyQualifiedName~Timeout"
# Run integration tests
dotnet test tests/StarGate.IntegrationTests --filter "FullyQualifiedName~Resilience"
# Run chaos tests
dotnet test tests/StarGate.IntegrationTests --filter "FullyQualifiedName~Chaos"
# Run performance tests
cd tests/StarGate.PerformanceTests
dotnet run -c Release
# Test timeout with slow database
# 1. Add artificial delay in MongoDB query
# 2. Verify timeout triggers
# 3. Check logs: "Database operation timed out: Timeout=10s"
# Test complete policy stack
# 1. Stop MongoDB
# 2. Create process (should retry)
# 3. After 10+ failures, circuit opens
# 4. Requests fail immediately with timeout
# 5. Wait 30s for half-open
# 6. Restart MongoDB
# 7. Circuit closes on success
# Measure policy overhead
# Compare performance:
# - Without policies: baseline
# - With retry only: +0.5ms
# - With circuit breaker: +0.3ms
# - With timeout: +0.2ms
# - With all: +1.0ms
# Verify health endpoint
curl http://localhost:5000/health | jq
# Should show:
# - All policies configured
# - Circuit states
# - Recent failures/successes
📚 References
🏷️ Labels
phase-8 resilience sprint-8.1 polly timeout testing
⏱️ Estimated Effort
10-12 hours
🔗 Dependencies
🔗 Related Issues
Part of Phase 8: Resilience - Sprint 8.1: Polly Integration
📌 Important Notes
Policy Wrapping Order
Timeout (outer) → Must be outermost
↓ Total operation timeout
Circuit Breaker → Protects from cascading failures
↓ Fails fast when open
Retry (inner) → Handles transient failures
↓ Multiple attempts
Operation → Actual work
Why this order?
- Timeout outermost: Ensures total operation time bounded
- Circuit breaker middle: Prevents retries when service is down
- Retry innermost: Each retry respects circuit state
Pessimistic vs Optimistic Timeout
Pessimistic (Default):
TimeoutStrategy.Pessimistic
- Cancels operation via CancellationToken
- Forces operation to stop
- Requires operation to respect token
- Recommended for most cases
Optimistic:
TimeoutStrategy.Optimistic
- Monitors operation duration
- Doesn't cancel operation
- Operation continues in background
- Use only when cancellation not possible
Timeout Values Rationale
HTTP: 30 seconds
- External API calls
- May include network latency
- Allows for slow responses
Database: 10 seconds
- Local network (low latency)
- Query should be fast
- Longer suggests query issue
Broker: 5 seconds
- Local network
- Should be very fast
- Longer suggests connection issue
Testing Strategy
Unit Tests:
- Test policy configuration
- Test timeout calculation
- Mock slow operations
Integration Tests:
- Test with real infrastructure
- Simulate slow responses
- Verify timeout triggers
Chaos Tests:
- Random failures
- Prolonged outages
- Network issues
- High load
Performance Tests:
- Measure overhead
- Compare scenarios
- Identify bottlenecks
Performance Overhead
Expected Overhead (Success Case):
- Retry: ~0.5ms (tracking state)
- Circuit Breaker: ~0.3ms (state check)
- Timeout: ~0.2ms (timer setup)
- Total: ~1ms (acceptable)
Failure Case:
- Retry: +7s (1s + 2s + 4s delays)
- Circuit Breaker: Fail immediately when open
- Timeout: Fail at timeout threshold
Monitoring Recommendations
Key Metrics:
- Policy execution count
- Success/failure rate
- Retry attempts
- Circuit state changes
- Timeout occurrences
- Average latency
Dashboards:
- Policy health overview
- Circuit breaker states
- Timeout trends
- Performance impact
Alerts:
- High timeout rate
- Circuit opened
- High retry rate
- Degraded performance
Production Considerations
Configuration Tuning:
- Start conservative
- Monitor metrics
- Adjust based on behavior
- Different values per environment
Gradual Rollout:
- Enable retry only
- Add circuit breaker
- Add timeout
- Monitor each step
Fallback Strategies:
- Cached responses
- Default values
- Degraded mode
- Clear error messages
Chaos Engineering
Regularly test:
- Random infrastructure failures
- Network latency injection
- Slow response simulation
- High load scenarios
- Resource exhaustion
Tools:
- Toxiproxy (network chaos)
- Simmy (chaos policies)
- Custom failure injection
Documentation Requirements
For Developers:
- Policy configuration guide
- Testing guidelines
- Troubleshooting tips
- Best practices
For Operations:
- Monitoring setup
- Alert configuration
- Incident response
- Performance tuning
📋 Task Description
Implement comprehensive timeout policies using Polly to prevent indefinite waiting on slow operations. Create extensive resilience tests covering retry, circuit breaker, and timeout scenarios with failure simulations and performance measurements.
🎯 Objectives
📦 Deliverables
1. Create Timeout Configuration
Create
src/StarGate.Infrastructure/Resilience/TimeoutConfiguration.cs:2. Create Timeout Policy Factory
Create
src/StarGate.Infrastructure/Resilience/TimeoutPolicyFactory.cs:3. Update Resilience Policy Wrapper
Update
src/StarGate.Infrastructure/Resilience/ResiliencePolicyWrapper.cs:4. Update Configuration
Update
src/StarGate.Server/appsettings.json:{ "Resilience": { "Timeout": { "HttpTimeoutSeconds": 30.0, "DatabaseTimeoutSeconds": 10.0, "BrokerTimeoutSeconds": 5.0, "UsePessimisticTimeout": true }, "Retry": { "MaxRetryAttempts": 3, "InitialDelaySeconds": 1.0, "MaxDelaySeconds": 30.0, "BackoffMultiplier": 2.0, "UseJitter": true }, "CircuitBreaker": { "FailureThreshold": 5, "FailureRateThreshold": 0.5, "MinimumThroughput": 10, "BreakDurationSeconds": 30.0, "SamplingDurationSeconds": 60.0 } } }5. Update Resilience Extensions
Update
src/StarGate.Infrastructure/Extensions/ResilienceServiceCollectionExtensions.cs:6. Create Resilience Integration Tests
Create
tests/StarGate.IntegrationTests/Resilience/ResilienceIntegrationTests.cs:Create
tests/StarGate.IntegrationTests/Resilience/ChaosTests.cs:7. Create Performance Tests
Create
tests/StarGate.PerformanceTests/ResiliencePolicyOverheadTests.cs:8. Create Resilience Documentation
Create
docs/RESILIENCE-STRATEGY.md:Timeout (outer)
↓
Circuit Breaker
↓
Retry (inner)
↓
Operation
✅ Acceptance Criteria
📝 Testing Instructions
📚 References
🏷️ Labels
phase-8resiliencesprint-8.1pollytimeouttesting⏱️ Estimated Effort
10-12 hours
🔗 Dependencies
🔗 Related Issues
Part of Phase 8: Resilience - Sprint 8.1: Polly Integration
📌 Important Notes
Policy Wrapping Order
Why this order?
Pessimistic vs Optimistic Timeout
Pessimistic (Default):
Optimistic:
Timeout Values Rationale
HTTP: 30 seconds
Database: 10 seconds
Broker: 5 seconds
Testing Strategy
Unit Tests:
Integration Tests:
Chaos Tests:
Performance Tests:
Performance Overhead
Expected Overhead (Success Case):
Failure Case:
Monitoring Recommendations
Key Metrics:
Dashboards:
Alerts:
Production Considerations
Configuration Tuning:
Gradual Rollout:
Fallback Strategies:
Chaos Engineering
Regularly test:
Tools:
Documentation Requirements
For Developers:
For Operations: