Skip to content

Phase 4+: Implement Circuit Breakers for External Dependencies #118

@artcava

Description

@artcava

📋 Task Description

Implement circuit breakers using Polly to protect against cascading failures when external dependencies (MongoDB, Redis, RabbitMQ) become unavailable. Configure circuit breakers with appropriate thresholds, break durations, and fallback behaviors.

🎯 Objectives

  • Install Polly NuGet packages
  • Implement circuit breaker for MongoDB operations
  • Implement circuit breaker for Redis operations
  • Implement circuit breaker for RabbitMQ operations
  • Configure circuit breaker thresholds (failure %, count)
  • Configure break duration and half-open retries
  • Add circuit breaker state monitoring
  • Implement fallback behaviors
  • Add circuit breaker metrics for Prometheus
  • Create manual circuit breaker control endpoints
  • Write unit tests for circuit breaker behavior
  • Document circuit breaker configuration

📦 Deliverables

1. Install Polly Packages

Update src/StarGate.Infrastructure/StarGate.Infrastructure.csproj:

<ItemGroup>
  <PackageReference Include="Polly" Version="8.3.1" />
  <PackageReference Include="Polly.Extensions.Http" Version="3.0.0" />
  <PackageReference Include="Polly.Contrib.WaitAndRetry" Version="1.1.1" />
</ItemGroup>

2. Create Circuit Breaker Configuration

Create src/StarGate.Infrastructure/Resilience/CircuitBreakerOptions.cs:

namespace StarGate.Infrastructure.Resilience;

public class CircuitBreakerOptions
{
    public const string SectionName = "Resilience:CircuitBreaker";

    /// <summary>
    /// Percentage of failures before breaking (0.0 - 1.0)
    /// </summary>
    public double FailureThreshold { get; set; } = 0.5;

    /// <summary>
    /// Minimum number of requests before breaking
    /// </summary>
    public int MinimumThroughput { get; set; } = 10;

    /// <summary>
    /// Duration to keep circuit open (seconds)
    /// </summary>
    public int BreakDurationSeconds { get; set; } = 30;

    /// <summary>
    /// Sampling duration for failure rate calculation (seconds)
    /// </summary>
    public int SamplingDurationSeconds { get; set; } = 60;

    /// <summary>
    /// Enable circuit breaker
    /// </summary>
    public bool Enabled { get; set; } = true;
}

3. Create Circuit Breaker Service

Create src/StarGate.Infrastructure/Resilience/CircuitBreakerService.cs:

namespace StarGate.Infrastructure.Resilience;

using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Options;
using Polly;
using Polly.CircuitBreaker;
using System.Collections.Concurrent;

public interface ICircuitBreakerService
{
    AsyncCircuitBreakerPolicy GetOrCreatePolicy(string resourceName);
    CircuitState GetState(string resourceName);
    void Reset(string resourceName);
    Dictionary<string, CircuitBreakerStatus> GetAllStatuses();
}

public class CircuitBreakerService : ICircuitBreakerService
{
    private readonly ConcurrentDictionary<string, AsyncCircuitBreakerPolicy> _policies;
    private readonly CircuitBreakerOptions _options;
    private readonly ILogger<CircuitBreakerService> _logger;

    public CircuitBreakerService(
        IOptions<CircuitBreakerOptions> options,
        ILogger<CircuitBreakerService> logger)
    {
        _options = options?.Value ?? throw new ArgumentNullException(nameof(options));
        _logger = logger ?? throw new ArgumentNullException(nameof(logger));
        _policies = new ConcurrentDictionary<string, AsyncCircuitBreakerPolicy>();
    }

    public AsyncCircuitBreakerPolicy GetOrCreatePolicy(string resourceName)
    {
        return _policies.GetOrAdd(resourceName, CreatePolicy);
    }

    private AsyncCircuitBreakerPolicy CreatePolicy(string resourceName)
    {
        if (!_options.Enabled)
        {
            _logger.LogInformation(
                "Circuit breaker disabled for {ResourceName}",
                resourceName);

            return Policy
                .Handle<Exception>()
                .CircuitBreakerAsync(
                    exceptionsAllowedBeforeBreaking: int.MaxValue,
                    durationOfBreak: TimeSpan.Zero);
        }

        var policy = Policy
            .Handle<Exception>(ex =>
            {
                // Don't break for validation errors
                return !(ex is ArgumentException || ex is ArgumentNullException);
            })
            .AdvancedCircuitBreakerAsync(
                failureThreshold: _options.FailureThreshold,
                samplingDuration: TimeSpan.FromSeconds(_options.SamplingDurationSeconds),
                minimumThroughput: _options.MinimumThroughput,
                durationOfBreak: TimeSpan.FromSeconds(_options.BreakDurationSeconds),
                onBreak: (exception, duration) =>
                {
                    _logger.LogWarning(
                        exception,
                        "Circuit breaker opened for {ResourceName}. Duration: {Duration}s",
                        resourceName,
                        duration.TotalSeconds);
                },
                onReset: () =>
                {
                    _logger.LogInformation(
                        "Circuit breaker closed for {ResourceName}",
                        resourceName);
                },
                onHalfOpen: () =>
                {
                    _logger.LogInformation(
                        "Circuit breaker half-open for {ResourceName}",
                        resourceName);
                });

        _logger.LogInformation(
            "Created circuit breaker for {ResourceName} with threshold {Threshold}% over {SamplingDuration}s",
            resourceName,
            _options.FailureThreshold * 100,
            _options.SamplingDurationSeconds);

        return policy;
    }

    public CircuitState GetState(string resourceName)
    {
        if (_policies.TryGetValue(resourceName, out var policy))
        {
            return policy.CircuitState;
        }
        return CircuitState.Closed;
    }

    public void Reset(string resourceName)
    {
        if (_policies.TryGetValue(resourceName, out var policy))
        {
            policy.Reset();
            _logger.LogInformation(
                "Circuit breaker manually reset for {ResourceName}",
                resourceName);
        }
    }

    public Dictionary<string, CircuitBreakerStatus> GetAllStatuses()
    {
        var statuses = new Dictionary<string, CircuitBreakerStatus>();

        foreach (var kvp in _policies)
        {
            statuses[kvp.Key] = new CircuitBreakerStatus
            {
                ResourceName = kvp.Key,
                State = kvp.Value.CircuitState.ToString(),
                IsOpen = kvp.Value.CircuitState == CircuitState.Open,
                IsHalfOpen = kvp.Value.CircuitState == CircuitState.HalfOpen
            };
        }

        return statuses;
    }
}

public class CircuitBreakerStatus
{
    public required string ResourceName { get; init; }
    public required string State { get; init; }
    public required bool IsOpen { get; init; }
    public required bool IsHalfOpen { get; init; }
}

4. Integrate Circuit Breakers into Repositories

Update src/StarGate.Infrastructure/Persistence/MongoDB/MongoProcessRepository.cs:

public class MongoProcessRepository : IProcessRepository
{
    private readonly IMongoCollection<Process> _collection;
    private readonly ICircuitBreakerService _circuitBreakerService;
    private readonly ILogger<MongoProcessRepository> _logger;
    private readonly AsyncCircuitBreakerPolicy _circuitBreaker;

    public MongoProcessRepository(
        IMongoDatabase database,
        ICircuitBreakerService circuitBreakerService,
        ILogger<MongoProcessRepository> logger)
    {
        _collection = database.GetCollection<Process>("processes");
        _circuitBreakerService = circuitBreakerService;
        _logger = logger;
        _circuitBreaker = circuitBreakerService.GetOrCreatePolicy("MongoDB");
    }

    public async Task<Process?> GetByIdAsync(
        Guid processId,
        CancellationToken cancellationToken = default)
    {
        try
        {
            return await _circuitBreaker.ExecuteAsync(async (ct) =>
            {
                var filter = Builders<Process>.Filter.Eq(p => p.ProcessId, processId);
                return await _collection.Find(filter).FirstOrDefaultAsync(ct);
            }, cancellationToken);
        }
        catch (BrokenCircuitException ex)
        {
            _logger.LogWarning(
                ex,
                "Circuit breaker open for MongoDB - GetByIdAsync failed fast");
            throw new ServiceUnavailableException("Database temporarily unavailable", ex);
        }
    }

    public async Task CreateAsync(
        Process process,
        CancellationToken cancellationToken = default)
    {
        try
        {
            await _circuitBreaker.ExecuteAsync(async (ct) =>
            {
                await _collection.InsertOneAsync(process, cancellationToken: ct);
            }, cancellationToken);
        }
        catch (BrokenCircuitException ex)
        {
            _logger.LogWarning(
                ex,
                "Circuit breaker open for MongoDB - CreateAsync failed fast");
            throw new ServiceUnavailableException("Database temporarily unavailable", ex);
        }
    }

    // Apply to all methods...
}

5. Create Circuit Breaker Control Endpoints

Create src/StarGate.Server/Endpoints/CircuitBreakerEndpoints.cs:

namespace StarGate.Server.Endpoints;

using Microsoft.AspNetCore.Mvc;
using StarGate.Infrastructure.Resilience;

public static class CircuitBreakerEndpoints
{
    public static void MapCircuitBreakerEndpoints(this IEndpointRouteBuilder app)
    {
        var group = app.MapGroup("/api/circuit-breakers")
            .WithTags("Circuit Breakers")
            .RequireAuthorization(); // Protect admin endpoints

        // Get all circuit breaker statuses
        group.MapGet("/", (
            [FromServices] ICircuitBreakerService service) =>
        {
            var statuses = service.GetAllStatuses();
            return Results.Ok(statuses);
        })
        .WithName("GetCircuitBreakerStatuses")
        .WithOpenApi();

        // Get specific circuit breaker status
        group.MapGet("/{resourceName}", (
            string resourceName,
            [FromServices] ICircuitBreakerService service) =>
        {
            var state = service.GetState(resourceName);
            return Results.Ok(new
            {
                resourceName,
                state = state.ToString(),
                isOpen = state == Polly.CircuitBreaker.CircuitState.Open
            });
        })
        .WithName("GetCircuitBreakerStatus")
        .WithOpenApi();

        // Reset circuit breaker
        group.MapPost("/{resourceName}/reset", (
            string resourceName,
            [FromServices] ICircuitBreakerService service) =>
        {
            service.Reset(resourceName);
            return Results.Ok(new
            {
                message = $"Circuit breaker reset for {resourceName}"
            });
        })
        .WithName("ResetCircuitBreaker")
        .WithOpenApi();
    }
}

Register in Program.cs:

app.MapCircuitBreakerEndpoints();

6. Add Circuit Breaker Metrics

Update src/StarGate.Core/Metrics/ApplicationMetrics.cs:

public static readonly Gauge CircuitBreakerState = Metrics.CreateGauge(
    "stargate_circuit_breaker_state",
    "Circuit breaker state (0=Closed, 1=Open, 2=HalfOpen)",
    new GaugeConfiguration
    {
        LabelNames = new[] { "resource_name" }
    });

public static readonly Counter CircuitBreakerOpened = Metrics.CreateCounter(
    "stargate_circuit_breaker_opened_total",
    "Total number of times circuit breaker opened",
    new CounterConfiguration
    {
        LabelNames = new[] { "resource_name" }
    });

public static readonly Counter CircuitBreakerRejected = Metrics.CreateCounter(
    "stargate_circuit_breaker_rejected_total",
    "Total number of requests rejected by circuit breaker",
    new CounterConfiguration
    {
        LabelNames = new[] { "resource_name" }
    });

Update circuit breaker to emit metrics:

onBreak: (exception, duration) =>
{
    _logger.LogWarning(...);
    ApplicationMetrics.CircuitBreakerOpened
        .WithLabels(resourceName)
        .Inc();
    ApplicationMetrics.CircuitBreakerState
        .WithLabels(resourceName)
        .Set(1); // Open
},
onReset: () =>
{
    _logger.LogInformation(...);
    ApplicationMetrics.CircuitBreakerState
        .WithLabels(resourceName)
        .Set(0); // Closed
},
onHalfOpen: () =>
{
    _logger.LogInformation(...);
    ApplicationMetrics.CircuitBreakerState
        .WithLabels(resourceName)
        .Set(2); // HalfOpen
}

7. Add Configuration

Update src/StarGate.Server/appsettings.json:

{
  "Resilience": {
    "CircuitBreaker": {
      "Enabled": true,
      "FailureThreshold": 0.5,
      "MinimumThroughput": 10,
      "BreakDurationSeconds": 30,
      "SamplingDurationSeconds": 60
    }
  }
}

8. Create Documentation

Create docs/CIRCUIT-BREAKERS.md:

# Circuit Breakers - StarGate

## Overview

Circuit breakers protect against cascading failures by failing fast when dependencies are unhealthy.

## Configuration

### Default Settings
- **Failure Threshold:** 50% (open after 50% failures)
- **Minimum Throughput:** 10 requests (need 10 requests before breaking)
- **Break Duration:** 30 seconds (stay open for 30s)
- **Sampling Duration:** 60 seconds (calculate failure rate over 60s)

### Tuning

**Aggressive (fail fast):**
```json
{
  "FailureThreshold": 0.3,
  "MinimumThroughput": 5,
  "BreakDurationSeconds": 15
}

Conservative (tolerate failures):

{
  "FailureThreshold": 0.7,
  "MinimumThroughput": 20,
  "BreakDurationSeconds": 60
}

States

Closed (Normal)

  • Requests pass through
  • Failures tracked
  • Opens if threshold exceeded

Open (Broken)

  • Requests fail immediately
  • No dependency calls made
  • Transitions to Half-Open after break duration

Half-Open (Testing)

  • Limited requests allowed through
  • If successful: Close
  • If failed: Open again

Protected Resources

  • MongoDB: All database operations
  • Redis: All cache operations
  • RabbitMQ: Message publishing

Monitoring

Prometheus Metrics

stargate_circuit_breaker_state{resource_name="MongoDB"}
stargate_circuit_breaker_opened_total{resource_name="MongoDB"}
stargate_circuit_breaker_rejected_total{resource_name="MongoDB"}

API Endpoints

GET /api/circuit-breakers
GET /api/circuit-breakers/MongoDB
POST /api/circuit-breakers/MongoDB/reset

Troubleshooting

Circuit Open for MongoDB

  1. Check MongoDB health
  2. Verify network connectivity
  3. Review error logs
  4. Wait for automatic recovery or reset manually

Manual Reset

curl -X POST http://localhost:5000/api/circuit-breakers/MongoDB/reset

Best Practices

  1. Don't break on validation errors (only infrastructure failures)
  2. Set appropriate thresholds for each dependency
  3. Monitor circuit breaker metrics in production
  4. Have fallback behaviors when circuit opens
  5. Test circuit breaker behavior in staging

## ✅ Acceptance Criteria

- [ ] Polly packages installed
- [ ] Circuit breaker options configured
- [ ] Circuit breaker service implemented
- [ ] MongoDB operations protected
- [ ] Redis operations protected
- [ ] RabbitMQ operations protected
- [ ] Circuit breaker states tracked
- [ ] Manual reset endpoint created
- [ ] Status query endpoints created
- [ ] Prometheus metrics emitted
- [ ] Fallback behaviors implemented
- [ ] Configuration externalized
- [ ] Unit tests for circuit breaker logic
- [ ] Integration tests with broken dependencies
- [ ] Documentation complete
- [ ] Code follows CODING-CONVENTIONS.md

## 📝 Testing Instructions

```bash
# Start infrastructure
docker-compose up -d

# Run application
dotnet run --project src/StarGate.Server

# Check circuit breaker status
curl http://localhost:5000/api/circuit-breakers

# Simulate MongoDB failure
docker stop stargate-mongodb

# Make requests (will fail and increment failure count)
for i in {1..15}; do
  curl -X POST http://localhost:5000/api/processes \
    -H "Content-Type: application/json" \
    -d '{"clientId": "test", "processType": "order", "clientProcessId": "test-'$i'"}'
done

# Check circuit breaker status (should be Open)
curl http://localhost:5000/api/circuit-breakers/MongoDB

# Try another request (should fail immediately)
curl -X POST http://localhost:5000/api/processes \
  -H "Content-Type: application/json" \
  -d '{"clientId": "test", "processType": "order", "clientProcessId": "test-fast-fail"}'
# Should return 503 immediately

# Start MongoDB again
docker start stargate-mongodb
sleep 5

# Wait for break duration (30s) or reset manually
curl -X POST http://localhost:5000/api/circuit-breakers/MongoDB/reset

# Check status (should be Closed)
curl http://localhost:5000/api/circuit-breakers/MongoDB

# Try request again (should succeed)
curl -X POST http://localhost:5000/api/processes \
  -H "Content-Type: application/json" \
  -d '{"clientId": "test", "processType": "order", "clientProcessId": "test-recovery"}'

# Check Prometheus metrics
curl http://localhost:5000/metrics | grep circuit_breaker

📚 References

🏷️ Labels

phase-4+ production-readiness circuit-breaker resilience polly

⏱️ Estimated Effort

6-8 hours

🔗 Dependencies

🔗 Related Issues

Part of "Production-Ready API" initiative - adds resilience to external dependencies

📌 Important Notes

Advanced vs Basic Circuit Breaker

Advanced (Recommended):

  • Tracks failure rate over time window
  • Requires minimum throughput before breaking
  • More sophisticated than simple count

Basic:

  • Breaks after N consecutive failures
  • Simpler but less flexible

Exception Handling

Break on:

  • TimeoutException
  • IOException
  • MongoConnectionException
  • RedisConnectionException

Don't break on:

  • ArgumentException (validation)
  • BusinessLogicException
  • DuplicateKeyException

Metrics Collection

Track:

  • Circuit state (Closed=0, Open=1, HalfOpen=2)
  • Times opened (counter)
  • Requests rejected (counter)
  • Mean time between failures

Fallback Strategies

MongoDB failure:

  • Return cached data (if available)
  • Return 503 Service Unavailable
  • Queue for retry later

Redis failure:

  • Skip caching
  • Fetch from source
  • Degrade gracefully

RabbitMQ failure:

  • Store in database fallback queue
  • Retry later via background job
  • Alert operations team

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions