Skip to content

fix(gateway): implement rollback mechanism for allocation failures #26

@FL-AntoineDurand

Description

@FL-AntoineDurand

Description

Implement rollback mechanism for gateway allocation failures. When /gateway/start fails after partial allocation (e.g., Nginx reload fails, DNS registration fails, or handshake fails), the system should automatically rollback all completed steps to prevent gateway pool exhaustion and inconsistent state.

Context

The gateway allocation process involves multiple steps:

  1. Allocate gateway from pool (database)
  2. Register DNS (org-{uuid}.domain.local → 127.0.0.1)
  3. Create Nginx config (routes org traffic to gateway HTTP port)
  4. Reload Nginx
  5. Call gateway handshake

If any step fails after step 1, the gateway remains allocated in the database but is unusable, leading to:

  • Pool exhaustion: Gateway marked as allocated but not serving
  • Inconsistent state: DNS/Nginx config may exist but gateway not initialized
  • Resource waste: Gateway stuck in unusable state
  • Manual intervention required: Admin must manually clean up

Current State

Location: packages/app-ganymede/src/routes/gateway/index.ts

Current Code:

} catch (error: any) {
  log(
    EPriority.Error,
    'GATEWAY_ALLOC',
    `Failed to start gateway for org ${organization_id}:`,
    error.message
  );

  // TODO: Cleanup on failure (deallocate, remove DNS, remove nginx config)

  if (error.message.includes('no_gateway_available')) {
    return res.status(503).json({ error: 'No available gateways in pool. Try again later.' });
  }

  return res.status(500).json({
    error: 'Failed to start gateway',
    details: error.message,
  });
}

Problem: No rollback logic - if allocation fails after step 1, gateway remains allocated.

Requirements

1. Track Allocation Steps

Track which steps have been completed:

  • Step tracking: Use flags or array to track completed steps
  • Before each step: Mark step as "in progress"
  • After each step: Mark step as "completed"
  • On failure: Know which steps need rollback

2. Rollback Logic

Implement rollback for each step:

  • Step 1 (Allocation): Call proc_organizations_gateways_stop(gateway_id) to deallocate
  • Step 2 (DNS): Call powerDNS.deregisterGateway(organization_id) to remove DNS
  • Step 3 (Nginx Config): Call nginxManager.removeGatewayConfig(organization_id) to remove config
  • Step 4 (Nginx Reload): No rollback needed (config already removed)
  • Step 5 (Handshake): No rollback needed (gateway not initialized)

3. Error Handling

Handle rollback errors gracefully:

  • Rollback failures: Log but don't throw (avoid masking original error)
  • Partial rollback: Continue rolling back even if one step fails
  • Error reporting: Return original error to user, log rollback status

4. Implementation Pattern

Use try-catch with step tracking:

const completedSteps: string[] = [];

try {
  // Step 1: Allocate
  completedSteps.push('allocate');
  
  // Step 2: DNS
  completedSteps.push('dns');
  
  // Step 3: Nginx config
  completedSteps.push('nginx-config');
  
  // Step 4: Reload
  completedSteps.push('nginx-reload');
  
  // Step 5: Handshake
  completedSteps.push('handshake');
} catch (error) {
  // Rollback in reverse order
  await rollbackAllocation(organization_id, gateway_id, completedSteps);
  throw error;
}

Implementation Plan

Phase 1: Step Tracking

  1. Add step tracking mechanism
  2. Track steps as they complete
  3. Test step tracking

Phase 2: Rollback Function

  1. Create rollbackAllocation() function
  2. Implement rollback for each step
  3. Handle rollback errors gracefully
  4. Test rollback logic

Phase 3: Integration

  1. Integrate rollback into /gateway/start endpoint
  2. Test with simulated failures at each step
  3. Verify rollback works correctly
  4. Test error handling

Phase 4: Testing

  1. Test failure at step 1 (no rollback needed)
  2. Test failure at step 2 (rollback step 1)
  3. Test failure at step 3 (rollback steps 1-2)
  4. Test failure at step 4 (rollback steps 1-3)
  5. Test failure at step 5 (rollback steps 1-4)
  6. Test rollback failure handling

Related Files

  • packages/app-ganymede/src/routes/gateway/index.ts - Gateway allocation endpoint
  • packages/app-ganymede/src/services/nginx-manager.ts - Nginx manager
  • packages/app-ganymede/src/services/powerdns.ts - PowerDNS service
  • packages/app-ganymede/database/procedures/proc_organizations_gateways_stop.sql - Deallocation procedure

Acceptance Criteria

  • Step tracking implemented
  • Rollback function created
  • Rollback deallocates gateway (database)
  • Rollback removes DNS records
  • Rollback removes Nginx config
  • Rollback handles errors gracefully
  • Rollback integrated into /gateway/start
  • Tested with failures at each step
  • Gateway pool not exhausted on failures
  • No inconsistent state after failures
  • Original error returned to user
  • Rollback status logged

Questions to Resolve

  1. Should we rollback Nginx reload if it fails? (Config already created)
  2. Should we attempt to rollback even if original error was critical?
  3. Should we have a timeout for rollback operations?
  4. Should we notify admins of rollback events?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions