Skip to content

feat(monitoring): add monitoring hooks for gateway lifecycle events #28

@FL-AntoineDurand

Description

@FL-AntoineDurand

Description

Add monitoring hooks for gateway lifecycle events. Implement centralized logging, metrics tracking, and alerting for gateway allocation, deallocation, utilization, and pool status to enable observability and proactive issue detection.

Context

Currently, gateway lifecycle events are logged but not systematically monitored:

  • Allocation/deallocation events logged but not tracked
  • No metrics for gateway utilization
  • No alerts for pool exhaustion
  • No centralized monitoring dashboard
  • Difficult to diagnose issues or predict capacity needs

Note: This task is marked as "Deferred" in the TODO, waiting for comprehensive monitoring implementation. However, it's valuable to track as a future enhancement.

Requirements

1. Event Logging

Log gateway lifecycle events to centralized logger:

  • Allocation events: Log when gateway is allocated (org, gateway_id, timestamp)
  • Deallocation events: Log when gateway is deallocated (org, gateway_id, timestamp, duration)
  • Allocation failures: Log allocation failures with error details
  • Pool status: Log pool status changes (available count, total count)

2. Metrics Tracking

Track gateway utilization metrics:

  • Pool utilization: Available gateways / total gateways
  • Allocation rate: Allocations per hour/day
  • Average allocation duration: How long gateways are allocated
  • Failure rate: Allocation failures / total attempts
  • Pool exhaustion events: Times when pool was exhausted

3. Alerting

Set up alerts for critical conditions:

  • Pool exhaustion: Alert when available gateways < threshold
  • High failure rate: Alert when allocation failure rate > threshold
  • Long allocation duration: Alert on unusually long allocations
  • Pool imbalance: Alert when pool utilization is consistently high/low

4. Monitoring Dashboard

Create monitoring dashboard (future):

  • Real-time pool status: Current available/total gateways
  • Allocation history: Chart of allocations over time
  • Utilization trends: Pool utilization over time
  • Failure analysis: Breakdown of allocation failures
  • Alert history: Recent alerts and resolutions

Implementation Plan

Phase 1: Event Logging

  1. Add structured logging for allocation events
  2. Add structured logging for deallocation events
  3. Add structured logging for failures
  4. Integrate with centralized logging system

Phase 2: Metrics Collection

  1. Add metrics collection for pool status
  2. Add metrics for allocation rate
  3. Add metrics for allocation duration
  4. Add metrics for failure rate
  5. Export metrics to monitoring system

Phase 3: Alerting

  1. Define alert thresholds
  2. Implement alert conditions
  3. Integrate with alerting system
  4. Test alert triggers

Phase 4: Dashboard (Future)

  1. Design monitoring dashboard
  2. Create dashboard components
  3. Integrate with metrics
  4. Add real-time updates

Related Files

  • packages/app-ganymede/src/routes/gateway/index.ts - Gateway allocation/deallocation
  • packages/app-ganymede/src/services/ - Gateway services
  • doc/architecture/LOGGING_AND_OBSERVABILITY.md - Logging architecture

Acceptance Criteria

  • Allocation events logged to centralized logger
  • Deallocation events logged to centralized logger
  • Failure events logged with error details
  • Pool status metrics tracked
  • Allocation rate metrics tracked
  • Allocation duration metrics tracked
  • Failure rate metrics tracked
  • Alerts configured for pool exhaustion
  • Alerts configured for high failure rate
  • Metrics exported to monitoring system
  • Documentation for monitoring setup

Questions to Resolve

  1. Which monitoring system to use? (Prometheus? Datadog? Custom?)
  2. What alert thresholds should be used?
  3. Should we track per-organization metrics?
  4. Should we implement dashboard now or defer?
  5. How should we handle alert noise (rate limiting, deduplication)?

Notes

  • This task is marked as "Deferred" in the TODO
  • Should be implemented as part of comprehensive monitoring system
  • May depend on logging infrastructure improvements
  • Consider implementing incrementally (logging first, then metrics, then alerts)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions