Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
2e5cf3b
feat: Add ProcessContext domain model for handler execution context
artcava Mar 2, 2026
1481080
feat: Add Metadata property to ProcessMessage
artcava Mar 2, 2026
87c4826
feat: Implement ProcessWorker background service
artcava Mar 2, 2026
82aa867
fix: Remove Metadata from FromProcess method
artcava Mar 2, 2026
1fba1dd
fix some errors
Mar 2, 2026
48bb2dd
fix: Use correct MessageContext method names
artcava Mar 2, 2026
d500691
fix: Update ProcessWorker tests to match actual constructor
artcava Mar 2, 2026
c5d415d
Merge pull request #142 from artcava/feature/99-process-worker-messag…
artcava Mar 2, 2026
0ab64f7
feat: add graceful shutdown handling to ProcessWorker
artcava Mar 2, 2026
2765b40
feat: add health check for ProcessWorker
artcava Mar 2, 2026
c6ceb07
feat: configure graceful shutdown and health checks in Program.cs
artcava Mar 2, 2026
7e9e47c
test: add unit tests for ProcessWorker graceful shutdown
artcava Mar 2, 2026
d63f2b7
test: add unit tests for ProcessWorkerHealthCheck
artcava Mar 2, 2026
3930910
docs: add graceful shutdown documentation and testing guide
artcava Mar 2, 2026
108c62c
fix: correct shutdown timeout configuration for HostApplicationBuilder
artcava Mar 2, 2026
ec2c3c2
fix: remove invalid BeDefined() assertions in tests
artcava Mar 2, 2026
c96df41
Merge pull request #143 from artcava/feature/issue-100-graceful-shutdown
artcava Mar 2, 2026
5c6e3a0
feat: add timeout enforcement infrastructure (#101)
artcava Mar 2, 2026
463e0d8
feat: implement timeout enforcement in ProcessWorker (#101)
artcava Mar 2, 2026
038181b
feat: register TimeoutScannerWorker and add comprehensive tests (#101)
artcava Mar 2, 2026
931c7cf
fix: implement GetTimedOutProcessesAsync in InMemoryProcessRepository…
artcava Mar 2, 2026
cb9105d
fix: refactor integration tests to use correct base fixture (#101)
artcava Mar 2, 2026
1f400ee
feat: add MongoRepositoryTestBase for repository integration tests
artcava Mar 2, 2026
6a7007d
refactor: migrate MongoProcessRepositoryTimeoutTests to use MongoRepo…
artcava Mar 2, 2026
f0738c8
test: add unit tests for TimeoutScannerWorker
artcava Mar 2, 2026
3a3d5ab
test: add timeout enforcement tests for ProcessWorker
artcava Mar 2, 2026
34ab67c
docs: add comprehensive timeout enforcement documentation
artcava Mar 2, 2026
90a15ba
fix: correct return type in TimeoutScannerWorkerTests mocks
artcava Mar 2, 2026
6b257a7
fix: correct return type in ProcessWorkerTimeoutTests mocks
artcava Mar 2, 2026
83ea91d
fix: use ReturnsAsync() for void async methods in ProcessWorkerTimeou…
artcava Mar 2, 2026
c3ca1a5
fix: use ReturnsAsync() for void async methods in TimeoutScannerWorke…
artcava Mar 2, 2026
e332911
fix: use Task.FromResult for void async method mocks
artcava Mar 2, 2026
0269505
fix: use Task.FromResult for void async method mocks in TimeoutScanne…
artcava Mar 2, 2026
6b128d4
fix: correct mock returns for Task<Process> methods
artcava Mar 2, 2026
23ad83b
fix: correct mock returns for Task<Process> in TimeoutScannerWorkerTests
artcava Mar 2, 2026
43fc053
fix: correct handler mock callback signature
artcava Mar 2, 2026
a46a0c8
fix: use async exception for scanner retry test
artcava Mar 2, 2026
53aa1d5
fix: correct ExecuteAsync mock to return Task<object>
artcava Mar 2, 2026
fe5974a
fix: suppress nullable warning and remove unnecessary async
artcava Mar 2, 2026
beb3847
fix: expect TaskCanceledException instead of OperationCanceledException
artcava Mar 2, 2026
0e9bbec
fix: increase wait time for scanner retry test
artcava Mar 2, 2026
ed7e787
Merge pull request #144 from artcava/feature/issue-101-timeout-enforc…
artcava Mar 2, 2026
80c8265
Update technical analysis
Mar 2, 2026
f6dd07d
feat: add RetryConfiguration with exponential backoff #102
artcava Mar 2, 2026
60d37a7
test: add unit tests for RetryConfiguration #102
artcava Mar 2, 2026
0db33b7
feat: integrate retry logic with exponential backoff in ProcessWorker…
artcava Mar 2, 2026
d7a2e8c
feat: add retry configuration and update Program.cs #102
artcava Mar 2, 2026
fc30b9b
feat: register RetryConfiguration in DI container #102
artcava Mar 2, 2026
608a051
docs: add comprehensive retry logic documentation #102
artcava Mar 2, 2026
95b308d
fix: update ProcessWorkerTests to include retry dependencies #102
artcava Mar 2, 2026
1104015
fix: update all test files to include retry dependencies #102
artcava Mar 2, 2026
020dc60
fix some errors
Mar 2, 2026
900581a
fix: correct jitter calculation to maintain Β±30% range #102
artcava Mar 2, 2026
29f3ee3
fix: apply MaxDelay cap after jitter calculation #102
artcava Mar 2, 2026
0bf0eb5
Merge pull request #145 from artcava/feature/issue-102-retry-logic-in…
artcava Mar 2, 2026
c65029e
update Technical Analysis
Mar 2, 2026
46cc77a
feat: implement error classification system (#103)
artcava Mar 2, 2026
dd2a6d3
Fix some errors
Mar 2, 2026
7574fcf
feat: implement DLX configuration and poison message detection (#103)
artcava Mar 2, 2026
0205a93
feat: integrate ErrorClassifier in ProcessWorker (#103)
artcava Mar 2, 2026
0aab1bf
Merge pull request #146 from artcava/feature/103-error-handling-ackno…
artcava Mar 2, 2026
54ed89e
update Technical Analysis
Mar 2, 2026
003272b
feat: implement IProcessHandler and IProcessHandlerFactory interfaces
artcava Mar 2, 2026
025ba37
feat: implement ProcessHandlerFactory with thread-safe registration
artcava Mar 2, 2026
e8c2596
feat: add DI extension methods for process handler registration
artcava Mar 2, 2026
ba8b21c
test: add comprehensive unit tests for ProcessHandlerFactory
artcava Mar 2, 2026
cb3bf39
fix some errors
Mar 2, 2026
9077646
fix: remove references to unimplemented handlers in extension methods
artcava Mar 2, 2026
40084e2
fix: remove ValidationResultTests referencing unimplemented types
artcava Mar 2, 2026
c7a1189
fix some erros
Mar 2, 2026
4563d97
fix: correct ProcessWorker to use IProcessHandlerFactory properly
artcava Mar 2, 2026
18f8f7b
fix: remove reference to non-existent Metadata property in Process
artcava Mar 2, 2026
7a4cb09
fix: update ProcessWorkerTimeoutTests to match new IProcessHandlerFac…
artcava Mar 2, 2026
c50f1e3
fix some erros
Mar 2, 2026
abeca0d
remove some unused usings
Mar 2, 2026
e30acbc
Merge pull request #147 from artcava/feature/104-process-handler-factory
artcava Mar 2, 2026
470b17d
remove async
Mar 2, 2026
d411e0b
feat: implement OrderProcessHandler with multi-step order processing
artcava Mar 2, 2026
5511f7c
feat: register OrderProcessHandler in DI container
artcava Mar 2, 2026
86b8223
Fix some errors
Mar 2, 2026
5c6d638
Merge pull request #148 from artcava/feature/105-implement-order-proc…
artcava Mar 2, 2026
cc755fe
feat: implement ShippingProcessHandler for shipping operations
artcava Mar 2, 2026
fb1bef7
test: add comprehensive unit tests for ShippingProcessHandler
artcava Mar 2, 2026
d924f99
docs: add shipping process API examples and documentation
artcava Mar 2, 2026
4f642c1
feat: register ShippingProcessHandler in DI container
artcava Mar 2, 2026
26ea473
Merge pull request #149 from artcava/feature/106-shipping-process-han…
artcava Mar 2, 2026
f837969
update technical analysis
Mar 2, 2026
a6f56e9
complete updates to technical analysis
Mar 2, 2026
7327a73
fix: eliminate duplicate CI checks by removing push trigger for develop
artcava Mar 2, 2026
4e1bfac
fix: eliminate flaky tests by making random behavior deterministic in…
artcava Mar 2, 2026
44500b6
fix: use deterministic seed in ShippingProcessHandler tests
artcava Mar 2, 2026
af605d0
Merge pull request #151 from artcava/fix/ci-duplicates-and-flaky-tests
artcava Mar 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
name: ci

# Trigger events as per Git Flow documentation
# Trigger events optimized to avoid duplicate runs
# - pull_request: Run checks on all PRs (primary validation)
# - push to main: Run checks after merge (final validation)
# - push tags: Trigger release workflow
# Note: Removed push trigger for develop to avoid duplicate runs with pull_request
on:
push:
branches:
- main
- develop
- main # Only run on main after PR merge
tags:
- 'v*'
- 'v*' # Trigger release on version tags
pull_request:
branches:
- main
Expand Down
340 changes: 340 additions & 0 deletions docs/GRACEFUL-SHUTDOWN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,340 @@
# Graceful Shutdown Guide

This document explains the graceful shutdown implementation in the `ProcessWorker` and provides testing instructions.

## Overview

The ProcessWorker implements comprehensive graceful shutdown handling to ensure:
- No message loss during shutdown
- Clean termination of in-progress operations
- Proper resource cleanup
- Coordinated shutdown with host application

## Architecture

### Shutdown Timeline

```
t=0s SIGTERM received
└─> CancellationToken signaled
└─> IsShuttingDown = true
└─> Reject new messages
└─> Continue processing active messages

t=30s Worker shutdown timeout reached
└─> Log warning if messages still active
└─> Force stop worker

t=45s Host shutdown timeout
└─> Process forcefully terminated
```

### Two-Timeout Strategy

#### Worker Shutdown Timeout (30s)
- **Purpose**: Internal timeout for active message completion
- **Behavior**: Allows worker to log warnings and handle stragglers gracefully
- **Configured in**: `ProcessWorker._shutdownTimeout`

#### Host Shutdown Timeout (45s)
- **Purpose**: External timeout for entire application
- **Behavior**: Includes worker shutdown + cleanup + 15s buffer
- **Configured in**: `Program.cs` β†’ `HostOptions.ShutdownTimeout`
- **Why 45s**: Prevents indefinite hangs while allowing graceful disposal

### Active Message Tracking

The worker uses a `ConcurrentDictionary<string, Task>` to track messages currently being processed:

```csharp
private readonly ConcurrentDictionary<string, Task> _activeMessages;
```

- **Key**: `{ProcessId}_{UniqueGuid}` to handle multiple deliveries of same message
- **Value**: The `Task` representing the message processing operation
- **Purpose**: Enables `Task.WhenAll()` to wait for completion during shutdown

## Message Requeue Strategy

### Cancelled Messages

Messages cancelled during shutdown are:
1. **NACK'd with requeue=true** β†’ Will be processed after restart
2. **Marked with error** β†’ `PROCESS_CANCELLED` with `retryable: true`
3. **Recorded in audit trail** β†’ Client can query process status

### Benefits
- **Zero message loss**: Every message is either completed or requeued
- **Eventual consistency**: Cancelled messages will be retried
- **Clear audit trail**: Process status reflects cancellation

## Fresh CancellationToken Pattern

### Problem
During shutdown, the main `CancellationToken` is cancelled. If we need to record errors in the database, the operation would be cancelled too.

### Solution
```csharp
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
await _processService.FailProcessAsync(processId, errorCode, message, canRetry, cts.Token);
```

### Benefits
- Error recording completes even during shutdown
- Short timeout (5s) prevents indefinite hangs
- Best-effort approach for critical operations

## Health Check Integration

The `ProcessWorkerHealthCheck` reports:
- **Healthy**: Normal operation, low message count
- **Degraded**: Shutting down OR high message count (>100)

Health check data includes:
```json
{
"status": "Healthy",
"data": {
"activeMessages": 5
}
}
```

### Kubernetes Integration

```yaml
apiVersion: v1
kind: Pod
metadata:
name: stargate-server
spec:
containers:
- name: stargate
image: stargate:latest
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
```

**During Shutdown**:
1. Health check returns `Degraded`
2. Kubernetes stops routing new traffic
3. In-flight messages complete within timeout
4. Pod terminates cleanly

## Testing Instructions

### Unit Tests

```bash
# Run shutdown-specific tests
dotnet test --filter "FullyQualifiedName~ProcessWorkerShutdownTests"

# Run health check tests
dotnet test --filter "FullyQualifiedName~ProcessWorkerHealthCheckTests"
```

### Local Testing

#### 1. Test Normal Shutdown

```bash
# Start dependencies
docker-compose up -d rabbitmq mongodb redis

# Start server
dotnet run --project src/StarGate.Server

# In another terminal, create test processes
for i in {1..5}; do
curl -X POST http://localhost:5000/api/processes \
-H "Content-Type: application/json" \
-d '{"clientId":"test","processType":"order","clientProcessId":"order-'$i'"}'
done

# Send SIGTERM (Ctrl+C in server terminal)
# Verify logs show:
# - "Shutdown requested. Active messages: X"
# - "Waiting for X active message(s) to complete"
# - "All active messages completed successfully"
# - "ProcessWorker stopped"
```

#### 2. Test Shutdown Timeout

```bash
# Create a handler that sleeps for 60 seconds
# (This simulates a long-running process)

# Start server and create process
curl -X POST http://localhost:5000/api/processes \
-H "Content-Type: application/json" \
-d '{"clientId":"test","processType":"long-running","clientProcessId":"test-1"}'

# Immediately send SIGTERM
# Verify logs show:
# - "Shutdown timeout exceeded. 1 message(s) still processing"
```

#### 3. Test Health Check

```bash
# Check health during normal operation
curl http://localhost:5000/health
# Expected: {"status":"Healthy","data":{"activeMessages":0}}

# Create multiple processes
for i in {1..10}; do
curl -X POST http://localhost:5000/api/processes \
-H "Content-Type: application/json" \
-d '{"clientId":"test","processType":"order","clientProcessId":"order-'$i'"}'
done

# Check health during processing
curl http://localhost:5000/health
# Expected: {"status":"Healthy","data":{"activeMessages":10}}

# Trigger shutdown and check immediately
# Expected: {"status":"Degraded","data":{"activeMessages":X}}
```

### Docker Container Testing

```bash
# Build and start container
docker-compose up -d stargate-server

# Check logs
docker logs -f stargate-server

# Graceful stop
docker-compose stop stargate-server

# Verify graceful shutdown in logs
docker logs stargate-server | grep "Shutdown"
```

### Kubernetes Testing

```bash
# Deploy to cluster
kubectl apply -f k8s/deployment.yaml

# Watch pod during shutdown
kubectl get pod -w

# Delete pod (triggers graceful shutdown)
kubectl delete pod <pod-name>

# Check logs
kubectl logs <pod-name> | grep "Shutdown"
```

## Monitoring and Observability

### Key Metrics to Track

1. **Shutdown Duration**: Time from SIGTERM to process exit
2. **Active Messages at Shutdown**: Count when shutdown begins
3. **Timeout Exceeded Count**: How often 30s timeout is hit
4. **Message Requeue Rate**: Frequency of cancelled message requeues

### Log Queries

```bash
# Find shutdown events
grep "Shutdown requested" logs/*.log

# Find timeout events
grep "timeout exceeded" logs/*.log

# Find cancelled processes
grep "PROCESS_CANCELLED" logs/*.log
```

## Production Considerations

### Tuning Timeouts

**Factors to Consider**:
- Average message processing duration
- 95th percentile message duration
- Message complexity and dependencies
- Database operation latency

**Recommendations**:
- Worker timeout should be 2x the 95th percentile
- Host timeout should be worker timeout + 15s buffer
- Monitor and adjust based on actual metrics

### Alerting

**Critical Alerts**:
- Shutdown timeout exceeded (indicates slow messages)
- High requeue rate (indicates frequent restarts)
- Health check degraded for >5 minutes

**Warning Alerts**:
- Active message count >100 (high load)
- Shutdown duration >20s (approaching timeout)

## Troubleshooting

### Issue: Shutdown takes too long

**Symptoms**: Logs show timeout warnings

**Diagnosis**:
1. Check message processing duration in logs
2. Identify slow handlers
3. Look for database/network latency

**Solutions**:
- Increase worker timeout
- Optimize slow handlers
- Add timeout to handler operations

### Issue: Messages lost during shutdown

**Symptoms**: Processes in "Processing" state after restart

**Diagnosis**:
1. Check if NACK is being called
2. Verify RabbitMQ requeue behavior
3. Check for exceptions in shutdown logic

**Solutions**:
- Ensure NACK with requeue=true
- Verify message consumer configuration
- Add exception handling in shutdown path

### Issue: Health check always degraded

**Symptoms**: Kubernetes constantly restarting pods

**Diagnosis**:
1. Check active message count
2. Verify if worker is stuck
3. Look for deadlocks or infinite loops

**Solutions**:
- Investigate high message count cause
- Add handler timeouts
- Review handler implementation

## References

- [.NET Generic Host Shutdown](https://learn.microsoft.com/en-us/dotnet/core/extensions/generic-host)
- [Graceful Shutdown Best Practices](https://andrewlock.net/extending-the-shutdown-timeout-setting-to-ensure-graceful-ihostedservice-shutdown/)
- [Health Checks in .NET](https://learn.microsoft.com/en-us/aspnet/core/host-and-deploy/health-checks)
- [Kubernetes Pod Lifecycle](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/)
Loading
Loading