Skip to content

feat(observability): integrate infrastructure service logs via OTLP Collector sidecars #29

@FL-AntoineDurand

Description

@FL-AntoineDurand

Integrate Infrastructure Service Logs into Observability Stack

Purpose

Currently, only Node.js application logs (Ganymede, Gateway) are integrated into the observability stack via OTLP. Infrastructure service logs (PostgreSQL, Nginx, dnsmasq, PowerDNS, and gateway container services) are not yet collected, limiting our ability to correlate application issues with underlying infrastructure problems.

This issue tracks the implementation of OTLP Collector sidecars to collect and forward infrastructure service logs to the main observability stack (Loki via OTLP Collector).

Current State

✅ Already Integrated

  • Node.js applications (Ganymede, Gateway) send logs via OTLP
  • Logs are structured with trace context and service metadata
  • Viewable in Grafana via Loki datasource

❌ Not Yet Integrated

Dev Container Services:

  • PostgreSQL - Database logs
  • Nginx (Stage 1) - Web server access/error logs (SSL termination)
  • dnsmasq - DNS forwarder logs
  • PowerDNS - DNS authoritative server logs

Gateway Container Services:

  • Nginx (Stage 2) - Reverse proxy logs inside gateway containers
  • OpenVPN - VPN server logs inside gateway containers
  • app-gateway - Already sends logs via OTLP ✅

Architecture

The OTLP Collector runs in a sibling Docker container, not inside the dev container. This means we need log shipper agents (OTLP Collector sidecars) running inside:

  1. Dev container - To collect dev container service logs
  2. Gateway containers - To collect gateway container service logs

Both sidecars forward logs to the main OTLP Collector via HTTP/gRPC API.

Implementation Tasks

Phase 1: Dev Container Sidecar

  • Install OTLP Collector binary in dev container
    • Download and install otelcol-contrib binary
    • Add to PATH or install to /usr/local/bin/
  • Create sidecar configuration file
    • Location: /root/.local-dev/observability/collector-sidecar-config.yaml
    • Configure filelog receivers for:
      • Nginx access logs (/var/log/nginx/access.log)
      • Nginx error logs (/var/log/nginx/error.log)
      • PowerDNS logs (/var/log/pdns.log)
      • dnsmasq logs (/var/log/dnsmasq.log)
    • Configure journald receiver for PostgreSQL
    • Set OTLP exporter to forward to main collector
  • Enable service logging
    • Configure dnsmasq: log-queries and log-facility=/var/log/dnsmasq.log in /etc/dnsmasq.conf
    • Configure PowerDNS: log-dns-queries=yes and log-facility=/var/log/pdns.log in /etc/powerdns/pdns.conf
    • Verify Nginx logs to /var/log/nginx/ (already configured)
  • Create startup script/service
    • Script to start sidecar collector as background process
    • Ensure it starts automatically on dev container startup
    • Handle collector restarts on failure
  • Update network configuration
    • Ensure dev container can access main OTLP Collector
    • Use Docker network: http://observability-otlp-collector:4318
    • Or Docker bridge gateway IP if needed

Phase 2: Gateway Container Sidecar

  • Update gateway Dockerfile
    • Install OTLP Collector binary (otelcol-contrib)
    • Add collector config file to image: /opt/gateway/observability/collector-config.yaml
  • Create gateway sidecar configuration
    • Configure filelog receivers for:
      • Nginx access logs (/var/log/nginx/access.log)
      • Nginx error logs (/var/log/nginx/error.log)
      • OpenVPN logs (/tmp/ovpn-*/logs/openvpn.log) - wildcard pattern for multiple VPN instances
    • Add resource attributes: service.name, gateway_id, deployment.environment
    • Set OTLP exporter to forward to main collector
  • Update gateway entrypoint
    • Start collector sidecar after services are started
    • Run as background process: otelcol-contrib --config=/opt/gateway/observability/collector-config.yaml &
  • Update gateway container network configuration
    • Ensure gateway containers can access main OTLP Collector
    • Add gateway containers to observability-network OR
    • Use Docker bridge and access collector via host IP
  • Update gateway-pool.sh if needed
    • Ensure network configuration allows collector access

Phase 3: Testing & Verification

  • Verify dev container sidecar is running
    • Check process: ps aux | grep otelcol-contrib
    • Check logs for errors
  • Verify gateway container sidecars are running
    • Check in each gateway container: docker exec <container> ps aux | grep otelcol-contrib
  • Verify logs appear in Grafana
    • Query: {service_name="nginx"}
    • Query: {service_name="dnsmasq"}
    • Query: {service_name="powerdns"}
    • Query: {service_name="postgresql"}
    • Query: {service_name="gateway-nginx"}
    • Query: {service_name="gateway-openvpn"}
  • Test log parsing
    • Verify Nginx access logs are parsed correctly (remote_addr, method, path, status, etc.)
    • Verify error logs include severity levels
    • Verify gateway logs include gateway_id attribute
  • Test log volume and performance
    • Monitor collector resource usage
    • Verify no performance degradation
    • Check log delivery latency

Phase 4: Documentation & Cleanup

  • Update observability setup documentation
    • Document sidecar configuration
    • Document log locations and formats
    • Add troubleshooting section
  • Create Grafana dashboard examples
    • Infrastructure service health dashboard
    • Log volume and error rate dashboards
  • Add log retention configuration
    • Configure Loki retention policies
    • Document storage impact estimates

Acceptance Criteria

  • All dev container service logs (PostgreSQL, Nginx, dnsmasq, PowerDNS) are visible in Grafana
  • All gateway container service logs (Nginx, OpenVPN) are visible in Grafana
  • Logs are properly labeled with service.name and other relevant attributes
  • Gateway logs include gateway_id for filtering per-gateway logs
  • Logs are parsed correctly (structured fields extracted)
  • Sidecars start automatically and handle failures gracefully
  • No performance degradation observed
  • Documentation is updated with setup and troubleshooting information

Log Locations Reference

Dev Container:

  • Nginx: /var/log/nginx/access.log, /var/log/nginx/error.log
  • PowerDNS: /var/log/pdns.log (after configuration)
  • dnsmasq: /var/log/dnsmasq.log (after configuration)
  • PostgreSQL: Systemd journal (journalctl -u postgresql)

Gateway Containers:

  • Nginx: /var/log/nginx/access.log, /var/log/nginx/error.log
  • OpenVPN: /tmp/ovpn-{random}/logs/openvpn.log (per VPN instance)

Related Documentation

  • See doc/guides/INFRASTRUCTURE_LOGS_OBSERVABILITY.md for detailed analysis and configuration examples
  • See scripts/local-dev/OBSERVABILITY_SETUP.md for observability stack setup

Notes

  • OTLP Collector sidecar approach is chosen for unified pipeline (all logs via OTLP)
  • Alternative solutions (Promtail, Fluent Bit) were considered but OTLP Collector provides better integration with existing stack
  • Log volume considerations: Nginx access logs can be high volume, consider sampling or filtering if needed
  • Gateway containers may run multiple OpenVPN instances, sidecar must handle wildcard log paths

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinfrastructureInfrastructure and DevOps related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions