📋 Task Description
Implement comprehensive retry policies using Polly to handle transient failures in external service calls (HTTP clients, database operations, message broker). Configure exponential backoff, jitter, and appropriate retry limits for different failure scenarios.
🎯 Objectives
- Install and configure Polly library
- Implement retry policies for HTTP clients
- Implement retry policies for database operations
- Implement retry policies for message broker
- Configure exponential backoff with jitter
- Define retry limits per service type
- Integrate policies with DI container
- Add comprehensive logging for retry attempts
- Write unit tests for retry behavior
- Write integration tests with simulated failures
- Document retry policy configuration
- Measure and monitor retry metrics
📦 Deliverables
1. Install Polly Package
Update src/StarGate.Infrastructure/StarGate.Infrastructure.csproj:
<ItemGroup>
<PackageReference Include="Polly" Version="8.3.1" />
<PackageReference Include="Polly.Extensions.Http" Version="3.0.0" />
<PackageReference Include="Microsoft.Extensions.Http.Polly" Version="8.0.0" />
</ItemGroup>
2. Create Retry Policy Configuration
Create src/StarGate.Infrastructure/Resilience/RetryPolicyConfiguration.cs:
namespace StarGate.Infrastructure.Resilience;
/// <summary>
/// Configuration for retry policies.
/// </summary>
public class RetryPolicyConfiguration
{
/// <summary>
/// Maximum number of retry attempts.
/// </summary>
public int MaxRetryAttempts { get; set; } = 3;
/// <summary>
/// Initial delay before first retry (seconds).
/// </summary>
public double InitialDelaySeconds { get; set; } = 1.0;
/// <summary>
/// Maximum delay between retries (seconds).
/// </summary>
public double MaxDelaySeconds { get; set; } = 30.0;
/// <summary>
/// Exponential backoff multiplier.
/// </summary>
public double BackoffMultiplier { get; set; } = 2.0;
/// <summary>
/// Whether to use jitter to prevent thundering herd.
/// </summary>
public bool UseJitter { get; set; } = true;
/// <summary>
/// Calculates delay for a specific retry attempt.
/// </summary>
public TimeSpan CalculateDelay(int retryAttempt)
{
var exponentialDelay = InitialDelaySeconds * Math.Pow(BackoffMultiplier, retryAttempt - 1);
var delay = Math.Min(exponentialDelay, MaxDelaySeconds);
if (UseJitter)
{
var random = new Random();
var jitter = delay * 0.2 * (random.NextDouble() - 0.5); // +/- 10%
delay += jitter;
}
return TimeSpan.FromSeconds(Math.Max(delay, 0));
}
}
3. Create Retry Policy Factory
Create src/StarGate.Infrastructure/Resilience/RetryPolicyFactory.cs:
using Microsoft.Extensions.Logging;
using Polly;
using Polly.Retry;
namespace StarGate.Infrastructure.Resilience;
/// <summary>
/// Factory for creating Polly retry policies.
/// </summary>
public static class RetryPolicyFactory
{
/// <summary>
/// Creates a retry policy for HTTP operations.
/// </summary>
public static AsyncRetryPolicy<HttpResponseMessage> CreateHttpRetryPolicy(
RetryPolicyConfiguration config,
ILogger logger)
{
return Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.Or<HttpRequestException>()
.Or<TimeoutException>()
.WaitAndRetryAsync(
retryCount: config.MaxRetryAttempts,
sleepDurationProvider: retryAttempt => config.CalculateDelay(retryAttempt),
onRetry: (outcome, timespan, retryAttempt, context) =>
{
var statusCode = outcome.Result?.StatusCode.ToString() ?? "N/A";
var exception = outcome.Exception?.GetType().Name ?? "None";
logger.LogWarning(
"HTTP retry attempt {RetryAttempt}/{MaxRetries}: StatusCode={StatusCode}, Exception={Exception}, Delay={Delay}ms",
retryAttempt,
config.MaxRetryAttempts,
statusCode,
exception,
timespan.TotalMilliseconds);
});
}
/// <summary>
/// Creates a retry policy for database operations.
/// </summary>
public static AsyncRetryPolicy CreateDatabaseRetryPolicy(
RetryPolicyConfiguration config,
ILogger logger)
{
return Policy
.Handle<TimeoutException>()
.Or<IOException>()
.Or<InvalidOperationException>(ex => ex.Message.Contains("connection"))
.WaitAndRetryAsync(
retryCount: config.MaxRetryAttempts,
sleepDurationProvider: retryAttempt => config.CalculateDelay(retryAttempt),
onRetry: (exception, timespan, retryAttempt, context) =>
{
logger.LogWarning(
exception,
"Database retry attempt {RetryAttempt}/{MaxRetries}: Exception={Exception}, Delay={Delay}ms",
retryAttempt,
config.MaxRetryAttempts,
exception.GetType().Name,
timespan.TotalMilliseconds);
});
}
/// <summary>
/// Creates a retry policy for message broker operations.
/// </summary>
public static AsyncRetryPolicy CreateBrokerRetryPolicy(
RetryPolicyConfiguration config,
ILogger logger)
{
return Policy
.Handle<IOException>()
.Or<TimeoutException>()
.Or<InvalidOperationException>(ex => ex.Message.Contains("connection"))
.WaitAndRetryAsync(
retryCount: config.MaxRetryAttempts,
sleepDurationProvider: retryAttempt => config.CalculateDelay(retryAttempt),
onRetry: (exception, timespan, retryAttempt, context) =>
{
logger.LogWarning(
exception,
"Broker retry attempt {RetryAttempt}/{MaxRetries}: Exception={Exception}, Delay={Delay}ms",
retryAttempt,
config.MaxRetryAttempts,
exception.GetType().Name,
timespan.TotalMilliseconds);
});
}
/// <summary>
/// Creates a generic retry policy for any async operation.
/// </summary>
public static AsyncRetryPolicy CreateGenericRetryPolicy(
RetryPolicyConfiguration config,
ILogger logger)
{
return Policy
.Handle<Exception>(ex => IsTransientException(ex))
.WaitAndRetryAsync(
retryCount: config.MaxRetryAttempts,
sleepDurationProvider: retryAttempt => config.CalculateDelay(retryAttempt),
onRetry: (exception, timespan, retryAttempt, context) =>
{
logger.LogWarning(
exception,
"Generic retry attempt {RetryAttempt}/{MaxRetries}: Exception={Exception}, Delay={Delay}ms",
retryAttempt,
config.MaxRetryAttempts,
exception.GetType().Name,
timespan.TotalMilliseconds);
});
}
private static bool IsTransientException(Exception ex)
{
return ex is TimeoutException
|| ex is HttpRequestException
|| ex is IOException
|| (ex is InvalidOperationException && ex.Message.Contains("connection"));
}
}
4. Configure Policies in DI Container
Create src/StarGate.Infrastructure/Extensions/ResilienceServiceCollectionExtensions.cs:
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using Polly;
using StarGate.Infrastructure.Resilience;
namespace StarGate.Infrastructure.Extensions;
/// <summary>
/// Extension methods for registering resilience policies.
/// </summary>
public static class ResilienceServiceCollectionExtensions
{
/// <summary>
/// Adds resilience policies to the service collection.
/// </summary>
public static IServiceCollection AddResiliencePolicies(
this IServiceCollection services,
IConfiguration configuration)
{
// Register retry policy configuration
services.Configure<RetryPolicyConfiguration>(
configuration.GetSection("Resilience:Retry"));
// Register retry policies as singletons
services.AddSingleton(provider =>
{
var config = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
var logger = provider.GetRequiredService<ILogger<RetryPolicyFactory>>();
return RetryPolicyFactory.CreateDatabaseRetryPolicy(config, logger);
});
services.AddSingleton(provider =>
{
var config = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
var logger = provider.GetRequiredService<ILogger<RetryPolicyFactory>>();
return RetryPolicyFactory.CreateBrokerRetryPolicy(config, logger);
});
return services;
}
/// <summary>
/// Adds HTTP client with retry policy.
/// </summary>
public static IHttpClientBuilder AddHttpClientWithRetry<TClient>(
this IServiceCollection services,
string name)
where TClient : class
{
return services
.AddHttpClient<TClient>(name)
.AddPolicyHandler((provider, request) =>
{
var config = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
var logger = provider.GetRequiredService<ILogger<TClient>>();
return RetryPolicyFactory.CreateHttpRetryPolicy(config, logger);
});
}
}
5. Update Configuration
Update src/StarGate.Server/appsettings.json:
{
"Resilience": {
"Retry": {
"MaxRetryAttempts": 3,
"InitialDelaySeconds": 1.0,
"MaxDelaySeconds": 30.0,
"BackoffMultiplier": 2.0,
"UseJitter": true
}
}
}
Update src/StarGate.Server/appsettings.Development.json:
{
"Resilience": {
"Retry": {
"MaxRetryAttempts": 2,
"InitialDelaySeconds": 0.5,
"MaxDelaySeconds": 10.0,
"BackoffMultiplier": 2.0,
"UseJitter": true
}
}
}
6. Apply Retry Policies to Repositories
Update src/StarGate.Infrastructure/Repositories/MongoProcessRepository.cs:
private readonly AsyncRetryPolicy _retryPolicy;
public MongoProcessRepository(
IMongoDatabase database,
AsyncRetryPolicy retryPolicy,
ILogger<MongoProcessRepository> logger)
{
_database = database ?? throw new ArgumentNullException(nameof(database));
_retryPolicy = retryPolicy ?? throw new ArgumentNullException(nameof(retryPolicy));
_logger = logger ?? throw new ArgumentNullException(nameof(logger));
_collection = _database.GetCollection<ProcessDocument>("processes");
}
public async Task CreateAsync(Process process, CancellationToken cancellationToken = default)
{
await _retryPolicy.ExecuteAsync(async () =>
{
var document = MapToDocument(process);
await _collection.InsertOneAsync(document, cancellationToken: cancellationToken);
_logger.LogDebug("Process created: ProcessId={ProcessId}", process.ProcessId);
});
}
public async Task<Process?> GetByIdAsync(Guid processId, CancellationToken cancellationToken = default)
{
return await _retryPolicy.ExecuteAsync(async () =>
{
var filter = Builders<ProcessDocument>.Filter.Eq(d => d.Id, processId.ToString());
var document = await _collection.Find(filter).FirstOrDefaultAsync(cancellationToken);
return document != null ? MapToDomain(document) : null;
});
}
7. Apply Retry Policies to Message Broker
Update src/StarGate.Infrastructure/Messaging/RabbitMqBroker.cs:
private readonly AsyncRetryPolicy _retryPolicy;
public RabbitMqBroker(
IConnection connection,
AsyncRetryPolicy retryPolicy,
ILogger<RabbitMqBroker> logger)
{
_connection = connection ?? throw new ArgumentNullException(nameof(connection));
_retryPolicy = retryPolicy ?? throw new ArgumentNullException(nameof(retryPolicy));
_logger = logger ?? throw new ArgumentNullException(nameof(logger));
}
public async Task PublishAsync<T>(
T message,
string routingKey,
CancellationToken cancellationToken = default) where T : class
{
await _retryPolicy.ExecuteAsync(async () =>
{
using var channel = _connection.CreateModel();
var messageBody = SerializeMessage(message);
var properties = channel.CreateBasicProperties();
properties.Persistent = true;
properties.ContentType = "application/json";
properties.MessageId = Guid.NewGuid().ToString();
channel.BasicPublish(
exchange: "stargate.processes",
routingKey: routingKey,
basicProperties: properties,
body: messageBody);
_logger.LogDebug("Message published: RoutingKey={RoutingKey}, MessageId={MessageId}",
routingKey, properties.MessageId);
await Task.CompletedTask;
});
}
8. Register Policies in Program.cs
Update src/StarGate.Server/Program.cs:
// Add resilience policies
builder.Services.AddResiliencePolicies(builder.Configuration);
9. Create Unit Tests
Create tests/StarGate.Infrastructure.Tests/Resilience/RetryPolicyConfigurationTests.cs:
using FluentAssertions;
using StarGate.Infrastructure.Resilience;
using Xunit;
namespace StarGate.Infrastructure.Tests.Resilience;
public class RetryPolicyConfigurationTests
{
[Theory]
[InlineData(1, 1.0)] // First retry: 1 second
[InlineData(2, 2.0)] // Second retry: 2 seconds
[InlineData(3, 4.0)] // Third retry: 4 seconds
[InlineData(4, 8.0)] // Fourth retry: 8 seconds
public void CalculateDelay_Should_UseExponentialBackoff(int retryAttempt, double expectedSeconds)
{
// Arrange
var config = new RetryPolicyConfiguration
{
InitialDelaySeconds = 1.0,
BackoffMultiplier = 2.0,
MaxDelaySeconds = 30.0,
UseJitter = false
};
// Act
var delay = config.CalculateDelay(retryAttempt);
// Assert
delay.TotalSeconds.Should().Be(expectedSeconds);
}
[Fact]
public void CalculateDelay_Should_RespectMaxDelay()
{
// Arrange
var config = new RetryPolicyConfiguration
{
InitialDelaySeconds = 1.0,
BackoffMultiplier = 2.0,
MaxDelaySeconds = 5.0,
UseJitter = false
};
// Act
var delay = config.CalculateDelay(10); // Would be 512 seconds without cap
// Assert
delay.TotalSeconds.Should().Be(5.0);
}
[Fact]
public void CalculateDelay_Should_AddJitter_WhenEnabled()
{
// Arrange
var config = new RetryPolicyConfiguration
{
InitialDelaySeconds = 10.0,
BackoffMultiplier = 2.0,
UseJitter = true
};
// Act
var delays = Enumerable.Range(0, 20)
.Select(_ => config.CalculateDelay(1).TotalSeconds)
.ToList();
// Assert - delays should vary due to jitter
delays.Should().OnlyHaveUniqueItems();
delays.Should().AllSatisfy(d => d.Should().BeInRange(9.0, 11.0)); // 10 +/- 10%
}
[Fact]
public void CalculateDelay_Should_NotReturnNegativeDelay()
{
// Arrange
var config = new RetryPolicyConfiguration
{
InitialDelaySeconds = 0.1,
UseJitter = true
};
// Act
var delays = Enumerable.Range(0, 100)
.Select(_ => config.CalculateDelay(1))
.ToList();
// Assert
delays.Should().AllSatisfy(d => d.Should().BeGreaterOrEqualTo(TimeSpan.Zero));
}
}
Create tests/StarGate.Infrastructure.Tests/Resilience/RetryPolicyFactoryTests.cs:
using FluentAssertions;
using Microsoft.Extensions.Logging.Abstractions;
using Polly;
using StarGate.Infrastructure.Resilience;
using Xunit;
namespace StarGate.Infrastructure.Tests.Resilience;
public class RetryPolicyFactoryTests
{
private readonly RetryPolicyConfiguration _config;
private readonly NullLogger<RetryPolicyFactory> _logger;
public RetryPolicyFactoryTests()
{
_config = new RetryPolicyConfiguration
{
MaxRetryAttempts = 3,
InitialDelaySeconds = 0.1,
UseJitter = false
};
_logger = NullLogger<RetryPolicyFactory>.Instance;
}
[Fact]
public async Task HttpRetryPolicy_Should_RetryOnHttpRequestException()
{
// Arrange
var policy = RetryPolicyFactory.CreateHttpRetryPolicy(_config, _logger);
var attemptCount = 0;
// Act
var act = async () => await policy.ExecuteAsync(async () =>
{
attemptCount++;
await Task.CompletedTask;
throw new HttpRequestException("Simulated failure");
});
// Assert
await act.Should().ThrowAsync<HttpRequestException>();
attemptCount.Should().Be(4); // Initial + 3 retries
}
[Fact]
public async Task DatabaseRetryPolicy_Should_RetryOnTimeoutException()
{
// Arrange
var policy = RetryPolicyFactory.CreateDatabaseRetryPolicy(_config, _logger);
var attemptCount = 0;
// Act
var act = async () => await policy.ExecuteAsync(async () =>
{
attemptCount++;
await Task.CompletedTask;
throw new TimeoutException("Simulated timeout");
});
// Assert
await act.Should().ThrowAsync<TimeoutException>();
attemptCount.Should().Be(4); // Initial + 3 retries
}
[Fact]
public async Task BrokerRetryPolicy_Should_RetryOnIOException()
{
// Arrange
var policy = RetryPolicyFactory.CreateBrokerRetryPolicy(_config, _logger);
var attemptCount = 0;
// Act
var act = async () => await policy.ExecuteAsync(async () =>
{
attemptCount++;
await Task.CompletedTask;
throw new IOException("Simulated IO error");
});
// Assert
await act.Should().ThrowAsync<IOException>();
attemptCount.Should().Be(4); // Initial + 3 retries
}
[Fact]
public async Task RetryPolicy_Should_SucceedOnEventualSuccess()
{
// Arrange
var policy = RetryPolicyFactory.CreateGenericRetryPolicy(_config, _logger);
var attemptCount = 0;
// Act
await policy.ExecuteAsync(async () =>
{
attemptCount++;
if (attemptCount < 3)
{
throw new TimeoutException("Transient failure");
}
await Task.CompletedTask;
});
// Assert
attemptCount.Should().Be(3); // 2 failures + 1 success
}
}
✅ Acceptance Criteria
📝 Testing Instructions
# Run unit tests
dotnet test tests/StarGate.Infrastructure.Tests --filter "FullyQualifiedName~Retry"
# Test retry with MongoDB
# 1. Start MongoDB
docker-compose up -d mongodb
# 2. Stop MongoDB to simulate failure
docker-compose stop mongodb
# 3. Try to create process (should retry and eventually fail)
POST /api/processes
# 4. Check logs for retry attempts:
# "Database retry attempt 1/3: Exception=TimeoutException, Delay=1000ms"
# "Database retry attempt 2/3: Exception=TimeoutException, Delay=2000ms"
# "Database retry attempt 3/3: Exception=TimeoutException, Delay=4000ms"
# 5. Restart MongoDB
docker-compose start mongodb
# 6. Verify requests succeed
# Test retry with RabbitMQ
# 1. Stop RabbitMQ during process creation
docker-compose stop rabbitmq
# 2. Create process (should retry)
POST /api/processes
# 3. Check logs for broker retry attempts
# Test exponential backoff
# Verify in logs that delays increase: 1s -> 2s -> 4s
# Test jitter
# Create multiple processes simultaneously
# Verify retry delays vary slightly (not all exactly 1s, 2s, 4s)
# Performance test
# Measure overhead with retry policies enabled
# Should add <10ms per operation in success case
📚 References
🏷️ Labels
phase-8 resilience sprint-8.1 polly retry-policies
⏱️ Estimated Effort
8-10 hours
🔗 Dependencies
🔗 Related Issues
Part of Phase 8: Resilience - Sprint 8.1: Polly Integration
📌 Important Notes
Exponential Backoff Formula
Delay = InitialDelay * (Multiplier ^ (RetryAttempt - 1))
Delay = min(Delay, MaxDelay)
With Jitter:
Jitter = Delay * 0.2 * (Random - 0.5)
FinalDelay = Delay + Jitter
Example (Initial=1s, Multiplier=2):
Retry 1: 1s (+/- 10% jitter)
Retry 2: 2s (+/- 10% jitter)
Retry 3: 4s (+/- 10% jitter)
Why Jitter?
Without Jitter:
- All failed requests retry at same time
- Thundering herd problem
- Overloads recovering service
With Jitter:
- Retries spread over time
- Smooth load distribution
- Better recovery success rate
Transient vs Permanent Failures
Transient (Retryable):
TimeoutException
HttpRequestException
IOException
- Connection errors
Permanent (Non-Retryable):
InvalidOperationException (validation)
ArgumentException
UnauthorizedException
- HTTP 4xx errors (except 408, 429)
HTTP Status Codes Strategy
Retry:
- 408 Request Timeout
- 429 Too Many Requests
- 500 Internal Server Error
- 502 Bad Gateway
- 503 Service Unavailable
- 504 Gateway Timeout
Don't Retry:
- 400 Bad Request
- 401 Unauthorized
- 403 Forbidden
- 404 Not Found
Configuration per Environment
Development:
{
"MaxRetryAttempts": 2,
"InitialDelaySeconds": 0.5,
"MaxDelaySeconds": 10.0
}
- Faster feedback during development
- Shorter delays for debugging
Production:
{
"MaxRetryAttempts": 3,
"InitialDelaySeconds": 1.0,
"MaxDelaySeconds": 30.0
}
- More resilient to transient failures
- Longer delays acceptable
Performance Considerations
Success Case:
- No overhead (policy check is fast)
- <1ms added latency
Failure Case:
- Retry delays add latency
- Example: 3 retries = up to 7s additional time (1s + 2s + 4s)
- Acceptable for transient failures
- Better than complete failure
Monitoring:
- Track retry attempts via logs
- Alert on high retry rates
- Indicates infrastructure issues
Integration with Circuit Breaker
Retry policies work best with circuit breakers:
Retry (inner) → Circuit Breaker (outer)
- Retry handles transient failures
- Circuit breaker prevents cascading failures
- Will be implemented in next issue
Testing Strategy
Unit Tests:
- Test policy configuration
- Test retry logic
- Mock failures
- Verify retry counts
Integration Tests:
- Stop/start infrastructure
- Simulate network issues
- Verify actual retry behavior
- Measure latency impact
📋 Task Description
Implement comprehensive retry policies using Polly to handle transient failures in external service calls (HTTP clients, database operations, message broker). Configure exponential backoff, jitter, and appropriate retry limits for different failure scenarios.
🎯 Objectives
📦 Deliverables
1. Install Polly Package
Update
src/StarGate.Infrastructure/StarGate.Infrastructure.csproj:2. Create Retry Policy Configuration
Create
src/StarGate.Infrastructure/Resilience/RetryPolicyConfiguration.cs:3. Create Retry Policy Factory
Create
src/StarGate.Infrastructure/Resilience/RetryPolicyFactory.cs:4. Configure Policies in DI Container
Create
src/StarGate.Infrastructure/Extensions/ResilienceServiceCollectionExtensions.cs:5. Update Configuration
Update
src/StarGate.Server/appsettings.json:{ "Resilience": { "Retry": { "MaxRetryAttempts": 3, "InitialDelaySeconds": 1.0, "MaxDelaySeconds": 30.0, "BackoffMultiplier": 2.0, "UseJitter": true } } }Update
src/StarGate.Server/appsettings.Development.json:{ "Resilience": { "Retry": { "MaxRetryAttempts": 2, "InitialDelaySeconds": 0.5, "MaxDelaySeconds": 10.0, "BackoffMultiplier": 2.0, "UseJitter": true } } }6. Apply Retry Policies to Repositories
Update
src/StarGate.Infrastructure/Repositories/MongoProcessRepository.cs:7. Apply Retry Policies to Message Broker
Update
src/StarGate.Infrastructure/Messaging/RabbitMqBroker.cs:8. Register Policies in Program.cs
Update
src/StarGate.Server/Program.cs:9. Create Unit Tests
Create
tests/StarGate.Infrastructure.Tests/Resilience/RetryPolicyConfigurationTests.cs:Create
tests/StarGate.Infrastructure.Tests/Resilience/RetryPolicyFactoryTests.cs:✅ Acceptance Criteria
📝 Testing Instructions
📚 References
🏷️ Labels
phase-8resiliencesprint-8.1pollyretry-policies⏱️ Estimated Effort
8-10 hours
🔗 Dependencies
🔗 Related Issues
Part of Phase 8: Resilience - Sprint 8.1: Polly Integration
📌 Important Notes
Exponential Backoff Formula
Why Jitter?
Without Jitter:
With Jitter:
Transient vs Permanent Failures
Transient (Retryable):
TimeoutExceptionHttpRequestExceptionIOExceptionPermanent (Non-Retryable):
InvalidOperationException(validation)ArgumentExceptionUnauthorizedExceptionHTTP Status Codes Strategy
Retry:
Don't Retry:
Configuration per Environment
Development:
{ "MaxRetryAttempts": 2, "InitialDelaySeconds": 0.5, "MaxDelaySeconds": 10.0 }Production:
{ "MaxRetryAttempts": 3, "InitialDelaySeconds": 1.0, "MaxDelaySeconds": 30.0 }Performance Considerations
Success Case:
Failure Case:
Monitoring:
Integration with Circuit Breaker
Retry policies work best with circuit breakers:
Testing Strategy
Unit Tests:
Integration Tests: