-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
Implement rollback mechanism for gateway allocation failures. When /gateway/start fails after partial allocation (e.g., Nginx reload fails, DNS registration fails, or handshake fails), the system should automatically rollback all completed steps to prevent gateway pool exhaustion and inconsistent state.
Context
The gateway allocation process involves multiple steps:
- Allocate gateway from pool (database)
- Register DNS (org-{uuid}.domain.local → 127.0.0.1)
- Create Nginx config (routes org traffic to gateway HTTP port)
- Reload Nginx
- Call gateway handshake
If any step fails after step 1, the gateway remains allocated in the database but is unusable, leading to:
- Pool exhaustion: Gateway marked as allocated but not serving
- Inconsistent state: DNS/Nginx config may exist but gateway not initialized
- Resource waste: Gateway stuck in unusable state
- Manual intervention required: Admin must manually clean up
Current State
Location: packages/app-ganymede/src/routes/gateway/index.ts
Current Code:
} catch (error: any) {
log(
EPriority.Error,
'GATEWAY_ALLOC',
`Failed to start gateway for org ${organization_id}:`,
error.message
);
// TODO: Cleanup on failure (deallocate, remove DNS, remove nginx config)
if (error.message.includes('no_gateway_available')) {
return res.status(503).json({ error: 'No available gateways in pool. Try again later.' });
}
return res.status(500).json({
error: 'Failed to start gateway',
details: error.message,
});
}Problem: No rollback logic - if allocation fails after step 1, gateway remains allocated.
Requirements
1. Track Allocation Steps
Track which steps have been completed:
- Step tracking: Use flags or array to track completed steps
- Before each step: Mark step as "in progress"
- After each step: Mark step as "completed"
- On failure: Know which steps need rollback
2. Rollback Logic
Implement rollback for each step:
- Step 1 (Allocation): Call
proc_organizations_gateways_stop(gateway_id)to deallocate - Step 2 (DNS): Call
powerDNS.deregisterGateway(organization_id)to remove DNS - Step 3 (Nginx Config): Call
nginxManager.removeGatewayConfig(organization_id)to remove config - Step 4 (Nginx Reload): No rollback needed (config already removed)
- Step 5 (Handshake): No rollback needed (gateway not initialized)
3. Error Handling
Handle rollback errors gracefully:
- Rollback failures: Log but don't throw (avoid masking original error)
- Partial rollback: Continue rolling back even if one step fails
- Error reporting: Return original error to user, log rollback status
4. Implementation Pattern
Use try-catch with step tracking:
const completedSteps: string[] = [];
try {
// Step 1: Allocate
completedSteps.push('allocate');
// Step 2: DNS
completedSteps.push('dns');
// Step 3: Nginx config
completedSteps.push('nginx-config');
// Step 4: Reload
completedSteps.push('nginx-reload');
// Step 5: Handshake
completedSteps.push('handshake');
} catch (error) {
// Rollback in reverse order
await rollbackAllocation(organization_id, gateway_id, completedSteps);
throw error;
}Implementation Plan
Phase 1: Step Tracking
- Add step tracking mechanism
- Track steps as they complete
- Test step tracking
Phase 2: Rollback Function
- Create
rollbackAllocation()function - Implement rollback for each step
- Handle rollback errors gracefully
- Test rollback logic
Phase 3: Integration
- Integrate rollback into
/gateway/startendpoint - Test with simulated failures at each step
- Verify rollback works correctly
- Test error handling
Phase 4: Testing
- Test failure at step 1 (no rollback needed)
- Test failure at step 2 (rollback step 1)
- Test failure at step 3 (rollback steps 1-2)
- Test failure at step 4 (rollback steps 1-3)
- Test failure at step 5 (rollback steps 1-4)
- Test rollback failure handling
Related Files
packages/app-ganymede/src/routes/gateway/index.ts- Gateway allocation endpointpackages/app-ganymede/src/services/nginx-manager.ts- Nginx managerpackages/app-ganymede/src/services/powerdns.ts- PowerDNS servicepackages/app-ganymede/database/procedures/proc_organizations_gateways_stop.sql- Deallocation procedure
Acceptance Criteria
- Step tracking implemented
- Rollback function created
- Rollback deallocates gateway (database)
- Rollback removes DNS records
- Rollback removes Nginx config
- Rollback handles errors gracefully
- Rollback integrated into
/gateway/start - Tested with failures at each step
- Gateway pool not exhausted on failures
- No inconsistent state after failures
- Original error returned to user
- Rollback status logged
Questions to Resolve
- Should we rollback Nginx reload if it fails? (Config already created)
- Should we attempt to rollback even if original error was critical?
- Should we have a timeout for rollback operations?
- Should we notify admins of rollback events?