Skip to content

Feature/improve OIDC debug mode#88

Open
CCimen wants to merge 15 commits intodevelopfrom
feature/improve-oidc-debug-mode
Open

Feature/improve OIDC debug mode#88
CCimen wants to merge 15 commits intodevelopfrom
feature/improve-oidc-debug-mode

Conversation

@CCimen
Copy link
Copy Markdown
Contributor

@CCimen CCimen commented Nov 6, 2025

Pull Request

Description

Improves OIDC federation error diagnostics and adds tenant slug feature for direct tenant access.

OIDC Error Handling & Diagnostics:

  • Fixed 502 Bad Gateway errors being invisible in logs
  • Added comprehensive error logging with correlation IDs for end-to-end tracing
  • Implemented timeout and retry logic (discovery/JWKS: 3s, token: 8s, retry GET only)
  • Added error classification (HTTP status, OAuth errors, network/TLS failures)
  • Detect proxy errors vs IdP application errors
  • Frontend and backend debug mode with timing metrics

Tenant Slug Feature:

  • Added optional slug field to federation configuration API
  • Auto-generates from tenant name if omitted (e.g., "Sundsvall Kommun" → "sundsvall-kommun")
  • Enables direct tenant access via ?tenant=slug URLs for SSO bookmarks
  • Returns tenant_slug in API responses for frontend routing
  • Validates uniqueness with audit logging for changes

Documentation:

  • Updated MULTITENANT_OIDC_SETUP_GUIDE.md with slug explanation and direct access examples
  • Updated FEDERATION_PER_TENANT.md with architecture notes
  • Enhanced Swagger API documentation with comprehensive field guide

Type of Change

  • 🚀 feat: A new feature
  • 🐛 fix: A bug fix
  • 📚 docs: Documentation only changes
  • 💄 style: Changes that do not affect the meaning of the code
  • ♻️ refactor: A code change that neither fixes a bug nor adds a feature
  • test: Adding missing tests or correcting existing tests
  • 🔧 chore: Changes to the build process or auxiliary tools
  • perf: A code change that improves performance
  • 🔄 ci: Changes to CI configuration files and scripts

Branch Naming Convention

  • My branch name follows the naming convention (feature/improve-oidc-debug-mode)

Checklist

  • My branch name follows the naming convention
  • I have tested my changes locally (all 663 unit tests + 18 integration tests passing)
  • I have updated documentation if necessary
  • My changes do not break existing functionality (zero breaking changes)
  • I have added tests for new functionality (5 integration tests for slug feature)

Testing

  • Tested locally with Auth0 OIDC provider
  • Verified direct tenant access with ?tenant=sundsvallz URL
  • All unit tests (663) and integration tests (18) passing
  • Real-world validation: 502 errors now visible with classification
  • Slug auto-generation and uniqueness validation verified

Additional Notes

Impact:

  • Operators can now diagnose OIDC login failures with correlation-based tracing
  • Users can bookmark tenant-specific URLs for direct SSO login
  • 98% diagnostic coverage for OIDC errors (502, OAuth, network, TLS)

Compatibility:

  • Zero breaking changes to API endpoints or response formats
  • Backward compatible: slug auto-generates if not provided
  • Debug mode optional (zero production overhead when disabled)

Improve error visibility and diagnostics for OIDC federation login flow
to diagnose 502 Bad Gateway and other authentication failures.

Changes:
- Add HTTP request wrapper with timeout, retry, and enhanced logging
- Implement per-endpoint timeouts (discovery/JWKS: 3s, token: 8s)
- Add retry logic for GET requests (3 attempts, exponential backoff)
- Never retry token POST (OAuth code replay protection)
- Log actual HTTP status codes (502, 503, 401, etc.) from IdP
- Classify errors: HTTP status, OAuth codes, network, TLS
- Detect proxy-generated errors vs IdP application errors
- Add complete flow logging: callback → state → token → validation → session
- Implement sensitive data redaction (tokens, secrets)
- Add 17 unit tests for error classification helpers
- Update integration test mocks for new wrapper API

Benefits:
- 502 errors now visible with actual status code and source
- Network errors distinguished (timeout, TLS, connection)
- Proxy errors identified (nginx, apache vs IdP app)
- Complete login flow traceable via correlation_id
- Debug mode provides detailed diagnostics

API Compatibility:
- Zero breaking changes to endpoints or response formats
- All HTTP error codes to frontend unchanged
- Only internal logging and error handling improved
Improve OIDC federation error visibility for Auth0, Entra ID, MobilityGuard.

Backend:
- Fix session closure bug (second login works)
- OAuth error detection (error params from IdP)
- Request logging (debug: redirect_uri, client_id, code hash)
- Error classification (502, OAuth, network, TLS)
- Timeout (3-8s), retry (GET only, 3x backoff)
- CORS headers (X-Error-Kind, X-Correlation-ID)
- Complete flow logging (7 stages)

Frontend:
- Correlation ID helpers (consistent extraction)
- Debug mode (ENABLE_VERBOSE_FRONTEND_LOGGING)
- Timing metrics (durationMs)
- Error classification from backend
- Response metadata logging

Tests:
- 3 critical bug-prevention tests
- 30 unit tests, 13 integration tests pass

Diagnostic coverage: 98% (502, OAuth errors, config issues, network errors)
Added optional slug field to PUT /tenants/{id}/federation endpoint:
- Auto-generates from tenant name if omitted
- Enables direct tenant access via ?tenant=slug URLs
- Returns slug in SetFederationResponse and InitiateAuthResponse
- Validates uniqueness and format (lowercase, alphanumeric + hyphens, max 63 chars)

Backend changes:
- Added slug handling logic with audit logging for changes
- Enhanced Swagger documentation with field guide
- Updated example from Stockholm to Sundsvall/Ånge

Frontend changes:
- Added tenant_slug to InitiateAuthResponse for routing

Tests:
- Added 5 integration tests for slug functionality
- Fixed encryption service tests to disable federation flag

Documentation:
- Updated MULTITENANT_OIDC_SETUP_GUIDE.md with slug explanation and direct access examples
- Updated FEDERATION_PER_TENANT.md with architecture notes
- Increase token endpoint timeout (connect 5s, total 10s) for load balancer
- Implement manual redirect handling for 302/307 responses (max 3 hops)
- Validate redirect targets (HTTPS + *.login.sundsvall.se only)
- Configure global aiohttp session with TCPConnector and timeouts
- Add curl and dnsutils to Dockerfile for debugging

Fixes connection timeout errors when load balancer redirects token requests.
Add family=socket.AF_INET to TCPConnector to prevent connection timeouts
when IPv6 addresses are blackholed/filtered in network infrastructure.

Resolves ConnectionTimeoutError when connecting to IdP endpoints that have
both IPv4 and IPv6 addresses but where IPv6 is not routable.
Replace blocking PyJWKClient (urllib) with async JWKS fetch via aiohttp wrapper.
Eliminates 5s IPv6 blackhole delay in JWKS fetch that caused 502 Bad Gateway errors.

- Use _make_idp_http_request for JWKS (IPv4-only, async, retry-enabled)
- Manual key extraction with PyJWKSet
- Enhanced error handling for HTTP status, JSON parsing, and missing keys

Reduces total login time from ~5s to <100ms.
- Add use_dns_cache to TCPConnector (5min TTL) to reduce repeated DNS lookups
- Add TraceConfig hooks to measure DNS resolution and TCP connection timing
- Add diagnostic logging to verify IPv4 forcing and DNS cache configuration
- Fix WrappedAiohttpClient to use consistent IPv4 and DNS cache settings
- Fix jwks_mock to convert PEM keys to proper JWKS format for async JWKS fetch
- Add tests for IPv4 enforcement and DNS caching verification
- Fix flaky backoff timing assertions (add 2ms tolerance for async jitter)

Provides DNS caching for repeat requests and detailed timing observability
to diagnose authentication delays. Improves OIDC performance and enables
precise diagnosis of network bottlenecks.
@CCimen CCimen force-pushed the feature/improve-oidc-debug-mode branch from 7cec20e to ef134df Compare November 7, 2025 18:44
CCimen and others added 7 commits November 7, 2025 18:46
DNS & Network Improvements:
- Enable DNS caching (5min TTL) to prevent repeated DNS timeouts
- Add TraceConfig hooks to measure DNS resolution and TCP connection timing
- Fix TraceConfig transport attribute error (broke Auth0 login)
- Add diagnostic logging to verify IPv4 forcing and DNS cache status

JWKS Caching (Restore PyJWKClient Behavior):
- Add Redis-based JWKS caching (1h TTL) to eliminate per-login IdP fetches
- Implement distributed Redis lock to prevent multi-worker thundering herd
- Add automatic cache invalidation on kid rotation (handles key changes)
- Graceful degradation when Redis unavailable (falls back to direct fetch)
- Normalize HTTP headers for case-insensitive Cache-Control parsing

Bug Fixes:
- Fix Redis None guards across all cache functions
- Replace silent exception swallowing with proper error logging
- Update test assertion to verify JWKS caching behavior

Performance Impact:
- 1st login: ~5s (cold start, DNS timeout on slow resolver)
- 2nd+ logins: <50ms (DNS + JWKS both cached)
- ~95% reduction in 502 Bad Gateway errors
- Multi-worker safe (distributed locking prevents IdP stampede)

Restores reliability of old PyJWKClient while keeping IPv4 forcing benefits.
Provides complete diagnostic visibility via timing logs.
- Increase token endpoint connect timeout from 5s to 12s
- Increase total timeout from 10s to 15s
- Add warning logging for DNS resolution >2s
- Add diagnostic hints for connection timeout errors
Refactor OIDC authentication to follow the same pattern as other auth
methods (MobilityGuard, Zitadel) by using event.fetch and handleFetch
URL transformation instead of native fetch with direct server URL.

Changes:
- Use INTRIC_BACKEND_URL everywhere, let handleFetch transform to
  INTRIC_BACKEND_SERVER_URL when needed
- Replace native fetch() with event.fetch() in oidc.server.ts
- Add getRequestEvent() to access SvelteKit request context
- Remove cookies parameter, use event.cookies instead

Enhanced error handling and diagnostics:
- Add comprehensive logging to handleFetch for all backend requests
- Log URL transformations (original -> rewritten)
- Log response status, timing, and slow response warnings (>5s)
- Add 30-second request timeout with AbortSignal
- Add specific 502 Bad Gateway error handling
- Improve network error detection (DNS, timeout, connection refused)
- Better error detail extraction (handles both JSON and plain text)

Benefits:
- Consistent auth pattern across all methods
- Centralized URL transformation in one place (hooks.server.ts)
- Should resolve 502 Bad Gateway errors caused by inconsistent fetch paths
- All server-to-server calls now use efficient internal networking
- Better debugging with transparent URL transformation logs
- Clear error messages distinguish different failure modes
Disable SvelteKit modulePreload and cssCodeSplit to prevent generating
15KB Link headers with 100+ preload entries that exceeded HAProxy's
16KB buffer limit (effective 15KB after maxrewrite reservation).

Changes:
- vite.config.ts: Set modulePreload: false, cssCodeSplit: false
- hooks.server.ts: Add safety filter to remove Link headers >12KB

Result: Headers reduced from 15,360 bytes to 165 bytes (98% reduction)

Verified with curl - Link header eliminated, total headers now <1KB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants