Skip to content

fix(nginx): add error recovery for nginx reload failures #27

@FL-AntoineDurand

Description

@FL-AntoineDurand

Description

Add error recovery for Nginx reload failures. When nginx -t (configuration test) fails after creating a new gateway config, the system should automatically clean up the bad config file and restore the previous known-good state instead of leaving Nginx in a broken state.

Context

When creating a new gateway Nginx configuration:

  1. Nginx config file is created
  2. nginx -t is run to test configuration
  3. If test passes, nginx -s reload is run
  4. If test fails, error is thrown but bad config file remains

Problem: If nginx -t fails, the bad config file is left in place, potentially causing:

  • Nginx to fail on next reload
  • Manual intervention required to fix
  • Risk of Nginx becoming completely broken
  • No automatic recovery

Current State

Location: packages/app-ganymede/src/services/nginx-manager.ts

Current Implementation:

  • createGatewayConfig() creates config file
  • reloadNginx() runs nginx -t then nginx -s reload
  • If nginx -t fails, error is thrown but config file remains

Problem: No cleanup of bad config files on test failure.

Requirements

1. Config Test Before Reload

Test configuration before reloading:

  • Current: reloadNginx() already tests with nginx -t
  • Enhancement: Ensure test always runs before reload
  • Error handling: Catch test failures and handle cleanup

2. Cleanup on Test Failure

Remove bad config file if test fails:

  • Detect failure: Catch nginx -t failure
  • Delete config: Remove the newly created config file
  • Log cleanup: Log that config was removed due to test failure
  • Return error: Return descriptive error to caller

3. Restore Previous State

Optionally restore previous known-good state:

  • Backup: Keep backup of previous config (optional)
  • Restore: Restore previous config if new one fails (optional)
  • Or: Simply remove bad config and let Nginx use existing configs

Recommended: Simply remove bad config (simpler, Nginx will use existing configs)

4. Error Messages

Provide descriptive error messages:

  • Test failure: "Nginx configuration test failed: [error details]"
  • Cleanup status: "Removed invalid config file: [path]"
  • Recovery status: "Nginx restored to previous state"

Implementation Plan

Phase 1: Enhance reloadNginx()

  1. Ensure nginx -t always runs before reload
  2. Catch test failures
  3. Identify which config file caused failure
  4. Delete bad config file
  5. Log cleanup operation

Phase 2: Config Tracking

  1. Track which config file was just created
  2. Map config file to organization
  3. Enable targeted cleanup

Phase 3: Error Handling

  1. Improve error messages
  2. Return descriptive errors
  3. Log cleanup operations
  4. Test error scenarios

Phase 4: Testing

  1. Test with valid config (should work)
  2. Test with invalid config (should cleanup)
  3. Test with multiple configs (should only remove bad one)
  4. Test Nginx state after cleanup

Related Files

  • packages/app-ganymede/src/services/nginx-manager.ts - Nginx manager
  • packages/app-ganymede/src/routes/gateway/index.ts - Gateway allocation (uses nginx manager)

Acceptance Criteria

  • nginx -t runs before reload
  • Test failures are caught
  • Bad config files are deleted on test failure
  • Cleanup is logged
  • Descriptive error messages returned
  • Nginx state restored (bad config removed)
  • Works with multiple configs
  • No manual intervention required

Questions to Resolve

  1. Should we backup previous config before creating new one?
  2. Should we restore previous config or just remove bad one?
  3. How do we identify which config file caused the failure?
  4. Should we validate config syntax before writing file?
  5. Should we support rollback of multiple config changes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions