📋 Task Description
Implement circuit breakers using Polly to protect against cascading failures when external dependencies (MongoDB, Redis, RabbitMQ) become unavailable. Configure circuit breakers with appropriate thresholds, break durations, and fallback behaviors.
🎯 Objectives
- Install Polly NuGet packages
- Implement circuit breaker for MongoDB operations
- Implement circuit breaker for Redis operations
- Implement circuit breaker for RabbitMQ operations
- Configure circuit breaker thresholds (failure %, count)
- Configure break duration and half-open retries
- Add circuit breaker state monitoring
- Implement fallback behaviors
- Add circuit breaker metrics for Prometheus
- Create manual circuit breaker control endpoints
- Write unit tests for circuit breaker behavior
- Document circuit breaker configuration
📦 Deliverables
1. Install Polly Packages
Update src/StarGate.Infrastructure/StarGate.Infrastructure.csproj:
<ItemGroup>
<PackageReference Include="Polly" Version="8.3.1" />
<PackageReference Include="Polly.Extensions.Http" Version="3.0.0" />
<PackageReference Include="Polly.Contrib.WaitAndRetry" Version="1.1.1" />
</ItemGroup>
2. Create Circuit Breaker Configuration
Create src/StarGate.Infrastructure/Resilience/CircuitBreakerOptions.cs:
namespace StarGate.Infrastructure.Resilience;
public class CircuitBreakerOptions
{
public const string SectionName = "Resilience:CircuitBreaker";
/// <summary>
/// Percentage of failures before breaking (0.0 - 1.0)
/// </summary>
public double FailureThreshold { get; set; } = 0.5;
/// <summary>
/// Minimum number of requests before breaking
/// </summary>
public int MinimumThroughput { get; set; } = 10;
/// <summary>
/// Duration to keep circuit open (seconds)
/// </summary>
public int BreakDurationSeconds { get; set; } = 30;
/// <summary>
/// Sampling duration for failure rate calculation (seconds)
/// </summary>
public int SamplingDurationSeconds { get; set; } = 60;
/// <summary>
/// Enable circuit breaker
/// </summary>
public bool Enabled { get; set; } = true;
}
3. Create Circuit Breaker Service
Create src/StarGate.Infrastructure/Resilience/CircuitBreakerService.cs:
namespace StarGate.Infrastructure.Resilience;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Options;
using Polly;
using Polly.CircuitBreaker;
using System.Collections.Concurrent;
public interface ICircuitBreakerService
{
AsyncCircuitBreakerPolicy GetOrCreatePolicy(string resourceName);
CircuitState GetState(string resourceName);
void Reset(string resourceName);
Dictionary<string, CircuitBreakerStatus> GetAllStatuses();
}
public class CircuitBreakerService : ICircuitBreakerService
{
private readonly ConcurrentDictionary<string, AsyncCircuitBreakerPolicy> _policies;
private readonly CircuitBreakerOptions _options;
private readonly ILogger<CircuitBreakerService> _logger;
public CircuitBreakerService(
IOptions<CircuitBreakerOptions> options,
ILogger<CircuitBreakerService> logger)
{
_options = options?.Value ?? throw new ArgumentNullException(nameof(options));
_logger = logger ?? throw new ArgumentNullException(nameof(logger));
_policies = new ConcurrentDictionary<string, AsyncCircuitBreakerPolicy>();
}
public AsyncCircuitBreakerPolicy GetOrCreatePolicy(string resourceName)
{
return _policies.GetOrAdd(resourceName, CreatePolicy);
}
private AsyncCircuitBreakerPolicy CreatePolicy(string resourceName)
{
if (!_options.Enabled)
{
_logger.LogInformation(
"Circuit breaker disabled for {ResourceName}",
resourceName);
return Policy
.Handle<Exception>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: int.MaxValue,
durationOfBreak: TimeSpan.Zero);
}
var policy = Policy
.Handle<Exception>(ex =>
{
// Don't break for validation errors
return !(ex is ArgumentException || ex is ArgumentNullException);
})
.AdvancedCircuitBreakerAsync(
failureThreshold: _options.FailureThreshold,
samplingDuration: TimeSpan.FromSeconds(_options.SamplingDurationSeconds),
minimumThroughput: _options.MinimumThroughput,
durationOfBreak: TimeSpan.FromSeconds(_options.BreakDurationSeconds),
onBreak: (exception, duration) =>
{
_logger.LogWarning(
exception,
"Circuit breaker opened for {ResourceName}. Duration: {Duration}s",
resourceName,
duration.TotalSeconds);
},
onReset: () =>
{
_logger.LogInformation(
"Circuit breaker closed for {ResourceName}",
resourceName);
},
onHalfOpen: () =>
{
_logger.LogInformation(
"Circuit breaker half-open for {ResourceName}",
resourceName);
});
_logger.LogInformation(
"Created circuit breaker for {ResourceName} with threshold {Threshold}% over {SamplingDuration}s",
resourceName,
_options.FailureThreshold * 100,
_options.SamplingDurationSeconds);
return policy;
}
public CircuitState GetState(string resourceName)
{
if (_policies.TryGetValue(resourceName, out var policy))
{
return policy.CircuitState;
}
return CircuitState.Closed;
}
public void Reset(string resourceName)
{
if (_policies.TryGetValue(resourceName, out var policy))
{
policy.Reset();
_logger.LogInformation(
"Circuit breaker manually reset for {ResourceName}",
resourceName);
}
}
public Dictionary<string, CircuitBreakerStatus> GetAllStatuses()
{
var statuses = new Dictionary<string, CircuitBreakerStatus>();
foreach (var kvp in _policies)
{
statuses[kvp.Key] = new CircuitBreakerStatus
{
ResourceName = kvp.Key,
State = kvp.Value.CircuitState.ToString(),
IsOpen = kvp.Value.CircuitState == CircuitState.Open,
IsHalfOpen = kvp.Value.CircuitState == CircuitState.HalfOpen
};
}
return statuses;
}
}
public class CircuitBreakerStatus
{
public required string ResourceName { get; init; }
public required string State { get; init; }
public required bool IsOpen { get; init; }
public required bool IsHalfOpen { get; init; }
}
4. Integrate Circuit Breakers into Repositories
Update src/StarGate.Infrastructure/Persistence/MongoDB/MongoProcessRepository.cs:
public class MongoProcessRepository : IProcessRepository
{
private readonly IMongoCollection<Process> _collection;
private readonly ICircuitBreakerService _circuitBreakerService;
private readonly ILogger<MongoProcessRepository> _logger;
private readonly AsyncCircuitBreakerPolicy _circuitBreaker;
public MongoProcessRepository(
IMongoDatabase database,
ICircuitBreakerService circuitBreakerService,
ILogger<MongoProcessRepository> logger)
{
_collection = database.GetCollection<Process>("processes");
_circuitBreakerService = circuitBreakerService;
_logger = logger;
_circuitBreaker = circuitBreakerService.GetOrCreatePolicy("MongoDB");
}
public async Task<Process?> GetByIdAsync(
Guid processId,
CancellationToken cancellationToken = default)
{
try
{
return await _circuitBreaker.ExecuteAsync(async (ct) =>
{
var filter = Builders<Process>.Filter.Eq(p => p.ProcessId, processId);
return await _collection.Find(filter).FirstOrDefaultAsync(ct);
}, cancellationToken);
}
catch (BrokenCircuitException ex)
{
_logger.LogWarning(
ex,
"Circuit breaker open for MongoDB - GetByIdAsync failed fast");
throw new ServiceUnavailableException("Database temporarily unavailable", ex);
}
}
public async Task CreateAsync(
Process process,
CancellationToken cancellationToken = default)
{
try
{
await _circuitBreaker.ExecuteAsync(async (ct) =>
{
await _collection.InsertOneAsync(process, cancellationToken: ct);
}, cancellationToken);
}
catch (BrokenCircuitException ex)
{
_logger.LogWarning(
ex,
"Circuit breaker open for MongoDB - CreateAsync failed fast");
throw new ServiceUnavailableException("Database temporarily unavailable", ex);
}
}
// Apply to all methods...
}
5. Create Circuit Breaker Control Endpoints
Create src/StarGate.Server/Endpoints/CircuitBreakerEndpoints.cs:
namespace StarGate.Server.Endpoints;
using Microsoft.AspNetCore.Mvc;
using StarGate.Infrastructure.Resilience;
public static class CircuitBreakerEndpoints
{
public static void MapCircuitBreakerEndpoints(this IEndpointRouteBuilder app)
{
var group = app.MapGroup("/api/circuit-breakers")
.WithTags("Circuit Breakers")
.RequireAuthorization(); // Protect admin endpoints
// Get all circuit breaker statuses
group.MapGet("/", (
[FromServices] ICircuitBreakerService service) =>
{
var statuses = service.GetAllStatuses();
return Results.Ok(statuses);
})
.WithName("GetCircuitBreakerStatuses")
.WithOpenApi();
// Get specific circuit breaker status
group.MapGet("/{resourceName}", (
string resourceName,
[FromServices] ICircuitBreakerService service) =>
{
var state = service.GetState(resourceName);
return Results.Ok(new
{
resourceName,
state = state.ToString(),
isOpen = state == Polly.CircuitBreaker.CircuitState.Open
});
})
.WithName("GetCircuitBreakerStatus")
.WithOpenApi();
// Reset circuit breaker
group.MapPost("/{resourceName}/reset", (
string resourceName,
[FromServices] ICircuitBreakerService service) =>
{
service.Reset(resourceName);
return Results.Ok(new
{
message = $"Circuit breaker reset for {resourceName}"
});
})
.WithName("ResetCircuitBreaker")
.WithOpenApi();
}
}
Register in Program.cs:
app.MapCircuitBreakerEndpoints();
6. Add Circuit Breaker Metrics
Update src/StarGate.Core/Metrics/ApplicationMetrics.cs:
public static readonly Gauge CircuitBreakerState = Metrics.CreateGauge(
"stargate_circuit_breaker_state",
"Circuit breaker state (0=Closed, 1=Open, 2=HalfOpen)",
new GaugeConfiguration
{
LabelNames = new[] { "resource_name" }
});
public static readonly Counter CircuitBreakerOpened = Metrics.CreateCounter(
"stargate_circuit_breaker_opened_total",
"Total number of times circuit breaker opened",
new CounterConfiguration
{
LabelNames = new[] { "resource_name" }
});
public static readonly Counter CircuitBreakerRejected = Metrics.CreateCounter(
"stargate_circuit_breaker_rejected_total",
"Total number of requests rejected by circuit breaker",
new CounterConfiguration
{
LabelNames = new[] { "resource_name" }
});
Update circuit breaker to emit metrics:
onBreak: (exception, duration) =>
{
_logger.LogWarning(...);
ApplicationMetrics.CircuitBreakerOpened
.WithLabels(resourceName)
.Inc();
ApplicationMetrics.CircuitBreakerState
.WithLabels(resourceName)
.Set(1); // Open
},
onReset: () =>
{
_logger.LogInformation(...);
ApplicationMetrics.CircuitBreakerState
.WithLabels(resourceName)
.Set(0); // Closed
},
onHalfOpen: () =>
{
_logger.LogInformation(...);
ApplicationMetrics.CircuitBreakerState
.WithLabels(resourceName)
.Set(2); // HalfOpen
}
7. Add Configuration
Update src/StarGate.Server/appsettings.json:
{
"Resilience": {
"CircuitBreaker": {
"Enabled": true,
"FailureThreshold": 0.5,
"MinimumThroughput": 10,
"BreakDurationSeconds": 30,
"SamplingDurationSeconds": 60
}
}
}
8. Create Documentation
Create docs/CIRCUIT-BREAKERS.md:
# Circuit Breakers - StarGate
## Overview
Circuit breakers protect against cascading failures by failing fast when dependencies are unhealthy.
## Configuration
### Default Settings
- **Failure Threshold:** 50% (open after 50% failures)
- **Minimum Throughput:** 10 requests (need 10 requests before breaking)
- **Break Duration:** 30 seconds (stay open for 30s)
- **Sampling Duration:** 60 seconds (calculate failure rate over 60s)
### Tuning
**Aggressive (fail fast):**
```json
{
"FailureThreshold": 0.3,
"MinimumThroughput": 5,
"BreakDurationSeconds": 15
}
Conservative (tolerate failures):
{
"FailureThreshold": 0.7,
"MinimumThroughput": 20,
"BreakDurationSeconds": 60
}
States
Closed (Normal)
- Requests pass through
- Failures tracked
- Opens if threshold exceeded
Open (Broken)
- Requests fail immediately
- No dependency calls made
- Transitions to Half-Open after break duration
Half-Open (Testing)
- Limited requests allowed through
- If successful: Close
- If failed: Open again
Protected Resources
- MongoDB: All database operations
- Redis: All cache operations
- RabbitMQ: Message publishing
Monitoring
Prometheus Metrics
stargate_circuit_breaker_state{resource_name="MongoDB"}
stargate_circuit_breaker_opened_total{resource_name="MongoDB"}
stargate_circuit_breaker_rejected_total{resource_name="MongoDB"}
API Endpoints
GET /api/circuit-breakers
GET /api/circuit-breakers/MongoDB
POST /api/circuit-breakers/MongoDB/reset
Troubleshooting
Circuit Open for MongoDB
- Check MongoDB health
- Verify network connectivity
- Review error logs
- Wait for automatic recovery or reset manually
Manual Reset
curl -X POST http://localhost:5000/api/circuit-breakers/MongoDB/reset
Best Practices
- Don't break on validation errors (only infrastructure failures)
- Set appropriate thresholds for each dependency
- Monitor circuit breaker metrics in production
- Have fallback behaviors when circuit opens
- Test circuit breaker behavior in staging
## ✅ Acceptance Criteria
- [ ] Polly packages installed
- [ ] Circuit breaker options configured
- [ ] Circuit breaker service implemented
- [ ] MongoDB operations protected
- [ ] Redis operations protected
- [ ] RabbitMQ operations protected
- [ ] Circuit breaker states tracked
- [ ] Manual reset endpoint created
- [ ] Status query endpoints created
- [ ] Prometheus metrics emitted
- [ ] Fallback behaviors implemented
- [ ] Configuration externalized
- [ ] Unit tests for circuit breaker logic
- [ ] Integration tests with broken dependencies
- [ ] Documentation complete
- [ ] Code follows CODING-CONVENTIONS.md
## 📝 Testing Instructions
```bash
# Start infrastructure
docker-compose up -d
# Run application
dotnet run --project src/StarGate.Server
# Check circuit breaker status
curl http://localhost:5000/api/circuit-breakers
# Simulate MongoDB failure
docker stop stargate-mongodb
# Make requests (will fail and increment failure count)
for i in {1..15}; do
curl -X POST http://localhost:5000/api/processes \
-H "Content-Type: application/json" \
-d '{"clientId": "test", "processType": "order", "clientProcessId": "test-'$i'"}'
done
# Check circuit breaker status (should be Open)
curl http://localhost:5000/api/circuit-breakers/MongoDB
# Try another request (should fail immediately)
curl -X POST http://localhost:5000/api/processes \
-H "Content-Type: application/json" \
-d '{"clientId": "test", "processType": "order", "clientProcessId": "test-fast-fail"}'
# Should return 503 immediately
# Start MongoDB again
docker start stargate-mongodb
sleep 5
# Wait for break duration (30s) or reset manually
curl -X POST http://localhost:5000/api/circuit-breakers/MongoDB/reset
# Check status (should be Closed)
curl http://localhost:5000/api/circuit-breakers/MongoDB
# Try request again (should succeed)
curl -X POST http://localhost:5000/api/processes \
-H "Content-Type: application/json" \
-d '{"clientId": "test", "processType": "order", "clientProcessId": "test-recovery"}'
# Check Prometheus metrics
curl http://localhost:5000/metrics | grep circuit_breaker
📚 References
🏷️ Labels
phase-4+ production-readiness circuit-breaker resilience polly
⏱️ Estimated Effort
6-8 hours
🔗 Dependencies
🔗 Related Issues
Part of "Production-Ready API" initiative - adds resilience to external dependencies
📌 Important Notes
Advanced vs Basic Circuit Breaker
Advanced (Recommended):
- Tracks failure rate over time window
- Requires minimum throughput before breaking
- More sophisticated than simple count
Basic:
- Breaks after N consecutive failures
- Simpler but less flexible
Exception Handling
Break on:
- TimeoutException
- IOException
- MongoConnectionException
- RedisConnectionException
Don't break on:
- ArgumentException (validation)
- BusinessLogicException
- DuplicateKeyException
Metrics Collection
Track:
- Circuit state (Closed=0, Open=1, HalfOpen=2)
- Times opened (counter)
- Requests rejected (counter)
- Mean time between failures
Fallback Strategies
MongoDB failure:
- Return cached data (if available)
- Return 503 Service Unavailable
- Queue for retry later
Redis failure:
- Skip caching
- Fetch from source
- Degrade gracefully
RabbitMQ failure:
- Store in database fallback queue
- Retry later via background job
- Alert operations team
📋 Task Description
Implement circuit breakers using Polly to protect against cascading failures when external dependencies (MongoDB, Redis, RabbitMQ) become unavailable. Configure circuit breakers with appropriate thresholds, break durations, and fallback behaviors.
🎯 Objectives
📦 Deliverables
1. Install Polly Packages
Update
src/StarGate.Infrastructure/StarGate.Infrastructure.csproj:2. Create Circuit Breaker Configuration
Create
src/StarGate.Infrastructure/Resilience/CircuitBreakerOptions.cs:3. Create Circuit Breaker Service
Create
src/StarGate.Infrastructure/Resilience/CircuitBreakerService.cs:4. Integrate Circuit Breakers into Repositories
Update
src/StarGate.Infrastructure/Persistence/MongoDB/MongoProcessRepository.cs:5. Create Circuit Breaker Control Endpoints
Create
src/StarGate.Server/Endpoints/CircuitBreakerEndpoints.cs:Register in
Program.cs:6. Add Circuit Breaker Metrics
Update
src/StarGate.Core/Metrics/ApplicationMetrics.cs:Update circuit breaker to emit metrics:
7. Add Configuration
Update
src/StarGate.Server/appsettings.json:{ "Resilience": { "CircuitBreaker": { "Enabled": true, "FailureThreshold": 0.5, "MinimumThroughput": 10, "BreakDurationSeconds": 30, "SamplingDurationSeconds": 60 } } }8. Create Documentation
Create
docs/CIRCUIT-BREAKERS.md:Conservative (tolerate failures):
{ "FailureThreshold": 0.7, "MinimumThroughput": 20, "BreakDurationSeconds": 60 }States
Closed (Normal)
Open (Broken)
Half-Open (Testing)
Protected Resources
Monitoring
Prometheus Metrics
API Endpoints
Troubleshooting
Circuit Open for MongoDB
Manual Reset
Best Practices
📚 References
🏷️ Labels
phase-4+production-readinesscircuit-breakerresiliencepolly⏱️ Estimated Effort
6-8 hours
🔗 Dependencies
🔗 Related Issues
Part of "Production-Ready API" initiative - adds resilience to external dependencies
📌 Important Notes
Advanced vs Basic Circuit Breaker
Advanced (Recommended):
Basic:
Exception Handling
Break on:
Don't break on:
Metrics Collection
Track:
Fallback Strategies
MongoDB failure:
Redis failure:
RabbitMQ failure: