feat(monitoring): add monitoring hooks for gateway lifecycle events

## Description

Add monitoring hooks for gateway lifecycle events. Implement centralized logging, metrics tracking, and alerting for gateway allocation, deallocation, utilization, and pool status to enable observability and proactive issue detection.

## Context

Currently, gateway lifecycle events are logged but not systematically monitored:
- Allocation/deallocation events logged but not tracked
- No metrics for gateway utilization
- No alerts for pool exhaustion
- No centralized monitoring dashboard
- Difficult to diagnose issues or predict capacity needs

**Note**: This task is marked as "Deferred" in the TODO, waiting for comprehensive monitoring implementation. However, it's valuable to track as a future enhancement.

## Requirements

### 1. Event Logging

Log gateway lifecycle events to centralized logger:

- **Allocation events**: Log when gateway is allocated (org, gateway_id, timestamp)
- **Deallocation events**: Log when gateway is deallocated (org, gateway_id, timestamp, duration)
- **Allocation failures**: Log allocation failures with error details
- **Pool status**: Log pool status changes (available count, total count)

### 2. Metrics Tracking

Track gateway utilization metrics:

- **Pool utilization**: Available gateways / total gateways
- **Allocation rate**: Allocations per hour/day
- **Average allocation duration**: How long gateways are allocated
- **Failure rate**: Allocation failures / total attempts
- **Pool exhaustion events**: Times when pool was exhausted

### 3. Alerting

Set up alerts for critical conditions:

- **Pool exhaustion**: Alert when available gateways < threshold
- **High failure rate**: Alert when allocation failure rate > threshold
- **Long allocation duration**: Alert on unusually long allocations
- **Pool imbalance**: Alert when pool utilization is consistently high/low

### 4. Monitoring Dashboard

Create monitoring dashboard (future):

- **Real-time pool status**: Current available/total gateways
- **Allocation history**: Chart of allocations over time
- **Utilization trends**: Pool utilization over time
- **Failure analysis**: Breakdown of allocation failures
- **Alert history**: Recent alerts and resolutions

## Implementation Plan

### Phase 1: Event Logging
1. Add structured logging for allocation events
2. Add structured logging for deallocation events
3. Add structured logging for failures
4. Integrate with centralized logging system

### Phase 2: Metrics Collection
1. Add metrics collection for pool status
2. Add metrics for allocation rate
3. Add metrics for allocation duration
4. Add metrics for failure rate
5. Export metrics to monitoring system

### Phase 3: Alerting
1. Define alert thresholds
2. Implement alert conditions
3. Integrate with alerting system
4. Test alert triggers

### Phase 4: Dashboard (Future)
1. Design monitoring dashboard
2. Create dashboard components
3. Integrate with metrics
4. Add real-time updates

## Related Files

- `packages/app-ganymede/src/routes/gateway/index.ts` - Gateway allocation/deallocation
- `packages/app-ganymede/src/services/` - Gateway services
- `doc/architecture/LOGGING_AND_OBSERVABILITY.md` - Logging architecture

## Acceptance Criteria

- [ ] Allocation events logged to centralized logger
- [ ] Deallocation events logged to centralized logger
- [ ] Failure events logged with error details
- [ ] Pool status metrics tracked
- [ ] Allocation rate metrics tracked
- [ ] Allocation duration metrics tracked
- [ ] Failure rate metrics tracked
- [ ] Alerts configured for pool exhaustion
- [ ] Alerts configured for high failure rate
- [ ] Metrics exported to monitoring system
- [ ] Documentation for monitoring setup

## Questions to Resolve

1. Which monitoring system to use? (Prometheus? Datadog? Custom?)
2. What alert thresholds should be used?
3. Should we track per-organization metrics?
4. Should we implement dashboard now or defer?
5. How should we handle alert noise (rate limiting, deduplication)?

## Notes

- This task is marked as "Deferred" in the TODO
- Should be implemented as part of comprehensive monitoring system
- May depend on logging infrastructure improvements
- Consider implementing incrementally (logging first, then metrics, then alerts)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(monitoring): add monitoring hooks for gateway lifecycle events #28

Description

Context

Requirements

1. Event Logging

2. Metrics Tracking

3. Alerting

4. Monitoring Dashboard

Implementation Plan

Phase 1: Event Logging

Phase 2: Metrics Collection

Phase 3: Alerting

Phase 4: Dashboard (Future)

Related Files

Acceptance Criteria

Questions to Resolve

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(monitoring): add monitoring hooks for gateway lifecycle events #28

Description

Description

Context

Requirements

1. Event Logging

2. Metrics Tracking

3. Alerting

4. Monitoring Dashboard

Implementation Plan

Phase 1: Event Logging

Phase 2: Metrics Collection

Phase 3: Alerting

Phase 4: Dashboard (Future)

Related Files

Acceptance Criteria

Questions to Resolve

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions