Skip to content

Phase 8.1: Implement Polly Retry Policies for External Services #107

@artcava

Description

@artcava

📋 Task Description

Implement comprehensive retry policies using Polly to handle transient failures in external service calls (HTTP clients, database operations, message broker). Configure exponential backoff, jitter, and appropriate retry limits for different failure scenarios.

🎯 Objectives

  • Install and configure Polly library
  • Implement retry policies for HTTP clients
  • Implement retry policies for database operations
  • Implement retry policies for message broker
  • Configure exponential backoff with jitter
  • Define retry limits per service type
  • Integrate policies with DI container
  • Add comprehensive logging for retry attempts
  • Write unit tests for retry behavior
  • Write integration tests with simulated failures
  • Document retry policy configuration
  • Measure and monitor retry metrics

📦 Deliverables

1. Install Polly Package

Update src/StarGate.Infrastructure/StarGate.Infrastructure.csproj:

<ItemGroup>
  <PackageReference Include="Polly" Version="8.3.1" />
  <PackageReference Include="Polly.Extensions.Http" Version="3.0.0" />
  <PackageReference Include="Microsoft.Extensions.Http.Polly" Version="8.0.0" />
</ItemGroup>

2. Create Retry Policy Configuration

Create src/StarGate.Infrastructure/Resilience/RetryPolicyConfiguration.cs:

namespace StarGate.Infrastructure.Resilience;

/// <summary>
/// Configuration for retry policies.
/// </summary>
public class RetryPolicyConfiguration
{
    /// <summary>
    /// Maximum number of retry attempts.
    /// </summary>
    public int MaxRetryAttempts { get; set; } = 3;

    /// <summary>
    /// Initial delay before first retry (seconds).
    /// </summary>
    public double InitialDelaySeconds { get; set; } = 1.0;

    /// <summary>
    /// Maximum delay between retries (seconds).
    /// </summary>
    public double MaxDelaySeconds { get; set; } = 30.0;

    /// <summary>
    /// Exponential backoff multiplier.
    /// </summary>
    public double BackoffMultiplier { get; set; } = 2.0;

    /// <summary>
    /// Whether to use jitter to prevent thundering herd.
    /// </summary>
    public bool UseJitter { get; set; } = true;

    /// <summary>
    /// Calculates delay for a specific retry attempt.
    /// </summary>
    public TimeSpan CalculateDelay(int retryAttempt)
    {
        var exponentialDelay = InitialDelaySeconds * Math.Pow(BackoffMultiplier, retryAttempt - 1);
        var delay = Math.Min(exponentialDelay, MaxDelaySeconds);

        if (UseJitter)
        {
            var random = new Random();
            var jitter = delay * 0.2 * (random.NextDouble() - 0.5); // +/- 10%
            delay += jitter;
        }

        return TimeSpan.FromSeconds(Math.Max(delay, 0));
    }
}

3. Create Retry Policy Factory

Create src/StarGate.Infrastructure/Resilience/RetryPolicyFactory.cs:

using Microsoft.Extensions.Logging;
using Polly;
using Polly.Retry;

namespace StarGate.Infrastructure.Resilience;

/// <summary>
/// Factory for creating Polly retry policies.
/// </summary>
public static class RetryPolicyFactory
{
    /// <summary>
    /// Creates a retry policy for HTTP operations.
    /// </summary>
    public static AsyncRetryPolicy<HttpResponseMessage> CreateHttpRetryPolicy(
        RetryPolicyConfiguration config,
        ILogger logger)
    {
        return Policy
            .HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
            .Or<HttpRequestException>()
            .Or<TimeoutException>()
            .WaitAndRetryAsync(
                retryCount: config.MaxRetryAttempts,
                sleepDurationProvider: retryAttempt => config.CalculateDelay(retryAttempt),
                onRetry: (outcome, timespan, retryAttempt, context) =>
                {
                    var statusCode = outcome.Result?.StatusCode.ToString() ?? "N/A";
                    var exception = outcome.Exception?.GetType().Name ?? "None";

                    logger.LogWarning(
                        "HTTP retry attempt {RetryAttempt}/{MaxRetries}: StatusCode={StatusCode}, Exception={Exception}, Delay={Delay}ms",
                        retryAttempt,
                        config.MaxRetryAttempts,
                        statusCode,
                        exception,
                        timespan.TotalMilliseconds);
                });
    }

    /// <summary>
    /// Creates a retry policy for database operations.
    /// </summary>
    public static AsyncRetryPolicy CreateDatabaseRetryPolicy(
        RetryPolicyConfiguration config,
        ILogger logger)
    {
        return Policy
            .Handle<TimeoutException>()
            .Or<IOException>()
            .Or<InvalidOperationException>(ex => ex.Message.Contains("connection"))
            .WaitAndRetryAsync(
                retryCount: config.MaxRetryAttempts,
                sleepDurationProvider: retryAttempt => config.CalculateDelay(retryAttempt),
                onRetry: (exception, timespan, retryAttempt, context) =>
                {
                    logger.LogWarning(
                        exception,
                        "Database retry attempt {RetryAttempt}/{MaxRetries}: Exception={Exception}, Delay={Delay}ms",
                        retryAttempt,
                        config.MaxRetryAttempts,
                        exception.GetType().Name,
                        timespan.TotalMilliseconds);
                });
    }

    /// <summary>
    /// Creates a retry policy for message broker operations.
    /// </summary>
    public static AsyncRetryPolicy CreateBrokerRetryPolicy(
        RetryPolicyConfiguration config,
        ILogger logger)
    {
        return Policy
            .Handle<IOException>()
            .Or<TimeoutException>()
            .Or<InvalidOperationException>(ex => ex.Message.Contains("connection"))
            .WaitAndRetryAsync(
                retryCount: config.MaxRetryAttempts,
                sleepDurationProvider: retryAttempt => config.CalculateDelay(retryAttempt),
                onRetry: (exception, timespan, retryAttempt, context) =>
                {
                    logger.LogWarning(
                        exception,
                        "Broker retry attempt {RetryAttempt}/{MaxRetries}: Exception={Exception}, Delay={Delay}ms",
                        retryAttempt,
                        config.MaxRetryAttempts,
                        exception.GetType().Name,
                        timespan.TotalMilliseconds);
                });
    }

    /// <summary>
    /// Creates a generic retry policy for any async operation.
    /// </summary>
    public static AsyncRetryPolicy CreateGenericRetryPolicy(
        RetryPolicyConfiguration config,
        ILogger logger)
    {
        return Policy
            .Handle<Exception>(ex => IsTransientException(ex))
            .WaitAndRetryAsync(
                retryCount: config.MaxRetryAttempts,
                sleepDurationProvider: retryAttempt => config.CalculateDelay(retryAttempt),
                onRetry: (exception, timespan, retryAttempt, context) =>
                {
                    logger.LogWarning(
                        exception,
                        "Generic retry attempt {RetryAttempt}/{MaxRetries}: Exception={Exception}, Delay={Delay}ms",
                        retryAttempt,
                        config.MaxRetryAttempts,
                        exception.GetType().Name,
                        timespan.TotalMilliseconds);
                });
    }

    private static bool IsTransientException(Exception ex)
    {
        return ex is TimeoutException
            || ex is HttpRequestException
            || ex is IOException
            || (ex is InvalidOperationException && ex.Message.Contains("connection"));
    }
}

4. Configure Policies in DI Container

Create src/StarGate.Infrastructure/Extensions/ResilienceServiceCollectionExtensions.cs:

using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using Polly;
using StarGate.Infrastructure.Resilience;

namespace StarGate.Infrastructure.Extensions;

/// <summary>
/// Extension methods for registering resilience policies.
/// </summary>
public static class ResilienceServiceCollectionExtensions
{
    /// <summary>
    /// Adds resilience policies to the service collection.
    /// </summary>
    public static IServiceCollection AddResiliencePolicies(
        this IServiceCollection services,
        IConfiguration configuration)
    {
        // Register retry policy configuration
        services.Configure<RetryPolicyConfiguration>(
            configuration.GetSection("Resilience:Retry"));

        // Register retry policies as singletons
        services.AddSingleton(provider =>
        {
            var config = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
            var logger = provider.GetRequiredService<ILogger<RetryPolicyFactory>>();
            return RetryPolicyFactory.CreateDatabaseRetryPolicy(config, logger);
        });

        services.AddSingleton(provider =>
        {
            var config = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
            var logger = provider.GetRequiredService<ILogger<RetryPolicyFactory>>();
            return RetryPolicyFactory.CreateBrokerRetryPolicy(config, logger);
        });

        return services;
    }

    /// <summary>
    /// Adds HTTP client with retry policy.
    /// </summary>
    public static IHttpClientBuilder AddHttpClientWithRetry<TClient>(
        this IServiceCollection services,
        string name)
        where TClient : class
    {
        return services
            .AddHttpClient<TClient>(name)
            .AddPolicyHandler((provider, request) =>
            {
                var config = provider.GetRequiredService<IOptions<RetryPolicyConfiguration>>().Value;
                var logger = provider.GetRequiredService<ILogger<TClient>>();
                return RetryPolicyFactory.CreateHttpRetryPolicy(config, logger);
            });
    }
}

5. Update Configuration

Update src/StarGate.Server/appsettings.json:

{
  "Resilience": {
    "Retry": {
      "MaxRetryAttempts": 3,
      "InitialDelaySeconds": 1.0,
      "MaxDelaySeconds": 30.0,
      "BackoffMultiplier": 2.0,
      "UseJitter": true
    }
  }
}

Update src/StarGate.Server/appsettings.Development.json:

{
  "Resilience": {
    "Retry": {
      "MaxRetryAttempts": 2,
      "InitialDelaySeconds": 0.5,
      "MaxDelaySeconds": 10.0,
      "BackoffMultiplier": 2.0,
      "UseJitter": true
    }
  }
}

6. Apply Retry Policies to Repositories

Update src/StarGate.Infrastructure/Repositories/MongoProcessRepository.cs:

private readonly AsyncRetryPolicy _retryPolicy;

public MongoProcessRepository(
    IMongoDatabase database,
    AsyncRetryPolicy retryPolicy,
    ILogger<MongoProcessRepository> logger)
{
    _database = database ?? throw new ArgumentNullException(nameof(database));
    _retryPolicy = retryPolicy ?? throw new ArgumentNullException(nameof(retryPolicy));
    _logger = logger ?? throw new ArgumentNullException(nameof(logger));
    _collection = _database.GetCollection<ProcessDocument>("processes");
}

public async Task CreateAsync(Process process, CancellationToken cancellationToken = default)
{
    await _retryPolicy.ExecuteAsync(async () =>
    {
        var document = MapToDocument(process);
        await _collection.InsertOneAsync(document, cancellationToken: cancellationToken);
        _logger.LogDebug("Process created: ProcessId={ProcessId}", process.ProcessId);
    });
}

public async Task<Process?> GetByIdAsync(Guid processId, CancellationToken cancellationToken = default)
{
    return await _retryPolicy.ExecuteAsync(async () =>
    {
        var filter = Builders<ProcessDocument>.Filter.Eq(d => d.Id, processId.ToString());
        var document = await _collection.Find(filter).FirstOrDefaultAsync(cancellationToken);
        return document != null ? MapToDomain(document) : null;
    });
}

7. Apply Retry Policies to Message Broker

Update src/StarGate.Infrastructure/Messaging/RabbitMqBroker.cs:

private readonly AsyncRetryPolicy _retryPolicy;

public RabbitMqBroker(
    IConnection connection,
    AsyncRetryPolicy retryPolicy,
    ILogger<RabbitMqBroker> logger)
{
    _connection = connection ?? throw new ArgumentNullException(nameof(connection));
    _retryPolicy = retryPolicy ?? throw new ArgumentNullException(nameof(retryPolicy));
    _logger = logger ?? throw new ArgumentNullException(nameof(logger));
}

public async Task PublishAsync<T>(
    T message,
    string routingKey,
    CancellationToken cancellationToken = default) where T : class
{
    await _retryPolicy.ExecuteAsync(async () =>
    {
        using var channel = _connection.CreateModel();
        
        var messageBody = SerializeMessage(message);
        var properties = channel.CreateBasicProperties();
        properties.Persistent = true;
        properties.ContentType = "application/json";
        properties.MessageId = Guid.NewGuid().ToString();

        channel.BasicPublish(
            exchange: "stargate.processes",
            routingKey: routingKey,
            basicProperties: properties,
            body: messageBody);

        _logger.LogDebug("Message published: RoutingKey={RoutingKey}, MessageId={MessageId}",
            routingKey, properties.MessageId);

        await Task.CompletedTask;
    });
}

8. Register Policies in Program.cs

Update src/StarGate.Server/Program.cs:

// Add resilience policies
builder.Services.AddResiliencePolicies(builder.Configuration);

9. Create Unit Tests

Create tests/StarGate.Infrastructure.Tests/Resilience/RetryPolicyConfigurationTests.cs:

using FluentAssertions;
using StarGate.Infrastructure.Resilience;
using Xunit;

namespace StarGate.Infrastructure.Tests.Resilience;

public class RetryPolicyConfigurationTests
{
    [Theory]
    [InlineData(1, 1.0)]   // First retry: 1 second
    [InlineData(2, 2.0)]   // Second retry: 2 seconds
    [InlineData(3, 4.0)]   // Third retry: 4 seconds
    [InlineData(4, 8.0)]   // Fourth retry: 8 seconds
    public void CalculateDelay_Should_UseExponentialBackoff(int retryAttempt, double expectedSeconds)
    {
        // Arrange
        var config = new RetryPolicyConfiguration
        {
            InitialDelaySeconds = 1.0,
            BackoffMultiplier = 2.0,
            MaxDelaySeconds = 30.0,
            UseJitter = false
        };

        // Act
        var delay = config.CalculateDelay(retryAttempt);

        // Assert
        delay.TotalSeconds.Should().Be(expectedSeconds);
    }

    [Fact]
    public void CalculateDelay_Should_RespectMaxDelay()
    {
        // Arrange
        var config = new RetryPolicyConfiguration
        {
            InitialDelaySeconds = 1.0,
            BackoffMultiplier = 2.0,
            MaxDelaySeconds = 5.0,
            UseJitter = false
        };

        // Act
        var delay = config.CalculateDelay(10); // Would be 512 seconds without cap

        // Assert
        delay.TotalSeconds.Should().Be(5.0);
    }

    [Fact]
    public void CalculateDelay_Should_AddJitter_WhenEnabled()
    {
        // Arrange
        var config = new RetryPolicyConfiguration
        {
            InitialDelaySeconds = 10.0,
            BackoffMultiplier = 2.0,
            UseJitter = true
        };

        // Act
        var delays = Enumerable.Range(0, 20)
            .Select(_ => config.CalculateDelay(1).TotalSeconds)
            .ToList();

        // Assert - delays should vary due to jitter
        delays.Should().OnlyHaveUniqueItems();
        delays.Should().AllSatisfy(d => d.Should().BeInRange(9.0, 11.0)); // 10 +/- 10%
    }

    [Fact]
    public void CalculateDelay_Should_NotReturnNegativeDelay()
    {
        // Arrange
        var config = new RetryPolicyConfiguration
        {
            InitialDelaySeconds = 0.1,
            UseJitter = true
        };

        // Act
        var delays = Enumerable.Range(0, 100)
            .Select(_ => config.CalculateDelay(1))
            .ToList();

        // Assert
        delays.Should().AllSatisfy(d => d.Should().BeGreaterOrEqualTo(TimeSpan.Zero));
    }
}

Create tests/StarGate.Infrastructure.Tests/Resilience/RetryPolicyFactoryTests.cs:

using FluentAssertions;
using Microsoft.Extensions.Logging.Abstractions;
using Polly;
using StarGate.Infrastructure.Resilience;
using Xunit;

namespace StarGate.Infrastructure.Tests.Resilience;

public class RetryPolicyFactoryTests
{
    private readonly RetryPolicyConfiguration _config;
    private readonly NullLogger<RetryPolicyFactory> _logger;

    public RetryPolicyFactoryTests()
    {
        _config = new RetryPolicyConfiguration
        {
            MaxRetryAttempts = 3,
            InitialDelaySeconds = 0.1,
            UseJitter = false
        };
        _logger = NullLogger<RetryPolicyFactory>.Instance;
    }

    [Fact]
    public async Task HttpRetryPolicy_Should_RetryOnHttpRequestException()
    {
        // Arrange
        var policy = RetryPolicyFactory.CreateHttpRetryPolicy(_config, _logger);
        var attemptCount = 0;

        // Act
        var act = async () => await policy.ExecuteAsync(async () =>
        {
            attemptCount++;
            await Task.CompletedTask;
            throw new HttpRequestException("Simulated failure");
        });

        // Assert
        await act.Should().ThrowAsync<HttpRequestException>();
        attemptCount.Should().Be(4); // Initial + 3 retries
    }

    [Fact]
    public async Task DatabaseRetryPolicy_Should_RetryOnTimeoutException()
    {
        // Arrange
        var policy = RetryPolicyFactory.CreateDatabaseRetryPolicy(_config, _logger);
        var attemptCount = 0;

        // Act
        var act = async () => await policy.ExecuteAsync(async () =>
        {
            attemptCount++;
            await Task.CompletedTask;
            throw new TimeoutException("Simulated timeout");
        });

        // Assert
        await act.Should().ThrowAsync<TimeoutException>();
        attemptCount.Should().Be(4); // Initial + 3 retries
    }

    [Fact]
    public async Task BrokerRetryPolicy_Should_RetryOnIOException()
    {
        // Arrange
        var policy = RetryPolicyFactory.CreateBrokerRetryPolicy(_config, _logger);
        var attemptCount = 0;

        // Act
        var act = async () => await policy.ExecuteAsync(async () =>
        {
            attemptCount++;
            await Task.CompletedTask;
            throw new IOException("Simulated IO error");
        });

        // Assert
        await act.Should().ThrowAsync<IOException>();
        attemptCount.Should().Be(4); // Initial + 3 retries
    }

    [Fact]
    public async Task RetryPolicy_Should_SucceedOnEventualSuccess()
    {
        // Arrange
        var policy = RetryPolicyFactory.CreateGenericRetryPolicy(_config, _logger);
        var attemptCount = 0;

        // Act
        await policy.ExecuteAsync(async () =>
        {
            attemptCount++;
            if (attemptCount < 3)
            {
                throw new TimeoutException("Transient failure");
            }
            await Task.CompletedTask;
        });

        // Assert
        attemptCount.Should().Be(3); // 2 failures + 1 success
    }
}

✅ Acceptance Criteria

  • Polly packages installed and configured
  • RetryPolicyConfiguration implemented with exponential backoff and jitter
  • RetryPolicyFactory created with policies for HTTP, database, and broker
  • Generic retry policy for transient exceptions
  • Retry policies registered in DI container
  • Retry policies applied to MongoProcessRepository
  • Retry policies applied to RabbitMqBroker
  • Configuration files updated with retry settings
  • Different settings for Development and Production
  • Comprehensive logging for retry attempts
  • Unit tests for RetryPolicyConfiguration
  • Unit tests for RetryPolicyFactory
  • Integration tests with simulated failures
  • Code follows CODING-CONVENTIONS.md

📝 Testing Instructions

# Run unit tests
dotnet test tests/StarGate.Infrastructure.Tests --filter "FullyQualifiedName~Retry"

# Test retry with MongoDB
# 1. Start MongoDB
docker-compose up -d mongodb

# 2. Stop MongoDB to simulate failure
docker-compose stop mongodb

# 3. Try to create process (should retry and eventually fail)
POST /api/processes

# 4. Check logs for retry attempts:
# "Database retry attempt 1/3: Exception=TimeoutException, Delay=1000ms"
# "Database retry attempt 2/3: Exception=TimeoutException, Delay=2000ms"
# "Database retry attempt 3/3: Exception=TimeoutException, Delay=4000ms"

# 5. Restart MongoDB
docker-compose start mongodb

# 6. Verify requests succeed

# Test retry with RabbitMQ
# 1. Stop RabbitMQ during process creation
docker-compose stop rabbitmq

# 2. Create process (should retry)
POST /api/processes

# 3. Check logs for broker retry attempts

# Test exponential backoff
# Verify in logs that delays increase: 1s -> 2s -> 4s

# Test jitter
# Create multiple processes simultaneously
# Verify retry delays vary slightly (not all exactly 1s, 2s, 4s)

# Performance test
# Measure overhead with retry policies enabled
# Should add <10ms per operation in success case

📚 References

🏷️ Labels

phase-8 resilience sprint-8.1 polly retry-policies

⏱️ Estimated Effort

8-10 hours

🔗 Dependencies

🔗 Related Issues

Part of Phase 8: Resilience - Sprint 8.1: Polly Integration

📌 Important Notes

Exponential Backoff Formula

Delay = InitialDelay * (Multiplier ^ (RetryAttempt - 1))
Delay = min(Delay, MaxDelay)

With Jitter:
Jitter = Delay * 0.2 * (Random - 0.5)
FinalDelay = Delay + Jitter

Example (Initial=1s, Multiplier=2):
Retry 1: 1s  (+/- 10% jitter)
Retry 2: 2s  (+/- 10% jitter)
Retry 3: 4s  (+/- 10% jitter)

Why Jitter?

Without Jitter:

  • All failed requests retry at same time
  • Thundering herd problem
  • Overloads recovering service

With Jitter:

  • Retries spread over time
  • Smooth load distribution
  • Better recovery success rate

Transient vs Permanent Failures

Transient (Retryable):

  • TimeoutException
  • HttpRequestException
  • IOException
  • Connection errors

Permanent (Non-Retryable):

  • InvalidOperationException (validation)
  • ArgumentException
  • UnauthorizedException
  • HTTP 4xx errors (except 408, 429)

HTTP Status Codes Strategy

Retry:

  • 408 Request Timeout
  • 429 Too Many Requests
  • 500 Internal Server Error
  • 502 Bad Gateway
  • 503 Service Unavailable
  • 504 Gateway Timeout

Don't Retry:

  • 400 Bad Request
  • 401 Unauthorized
  • 403 Forbidden
  • 404 Not Found

Configuration per Environment

Development:

{
  "MaxRetryAttempts": 2,
  "InitialDelaySeconds": 0.5,
  "MaxDelaySeconds": 10.0
}
  • Faster feedback during development
  • Shorter delays for debugging

Production:

{
  "MaxRetryAttempts": 3,
  "InitialDelaySeconds": 1.0,
  "MaxDelaySeconds": 30.0
}
  • More resilient to transient failures
  • Longer delays acceptable

Performance Considerations

Success Case:

  • No overhead (policy check is fast)
  • <1ms added latency

Failure Case:

  • Retry delays add latency
  • Example: 3 retries = up to 7s additional time (1s + 2s + 4s)
  • Acceptable for transient failures
  • Better than complete failure

Monitoring:

  • Track retry attempts via logs
  • Alert on high retry rates
  • Indicates infrastructure issues

Integration with Circuit Breaker

Retry policies work best with circuit breakers:

Retry (inner) → Circuit Breaker (outer)
  • Retry handles transient failures
  • Circuit breaker prevents cascading failures
  • Will be implemented in next issue

Testing Strategy

Unit Tests:

  • Test policy configuration
  • Test retry logic
  • Mock failures
  • Verify retry counts

Integration Tests:

  • Stop/start infrastructure
  • Simulate network issues
  • Verify actual retry behavior
  • Measure latency impact

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions