Open
Conversation
Improve error visibility and diagnostics for OIDC federation login flow to diagnose 502 Bad Gateway and other authentication failures. Changes: - Add HTTP request wrapper with timeout, retry, and enhanced logging - Implement per-endpoint timeouts (discovery/JWKS: 3s, token: 8s) - Add retry logic for GET requests (3 attempts, exponential backoff) - Never retry token POST (OAuth code replay protection) - Log actual HTTP status codes (502, 503, 401, etc.) from IdP - Classify errors: HTTP status, OAuth codes, network, TLS - Detect proxy-generated errors vs IdP application errors - Add complete flow logging: callback → state → token → validation → session - Implement sensitive data redaction (tokens, secrets) - Add 17 unit tests for error classification helpers - Update integration test mocks for new wrapper API Benefits: - 502 errors now visible with actual status code and source - Network errors distinguished (timeout, TLS, connection) - Proxy errors identified (nginx, apache vs IdP app) - Complete login flow traceable via correlation_id - Debug mode provides detailed diagnostics API Compatibility: - Zero breaking changes to endpoints or response formats - All HTTP error codes to frontend unchanged - Only internal logging and error handling improved
Improve OIDC federation error visibility for Auth0, Entra ID, MobilityGuard. Backend: - Fix session closure bug (second login works) - OAuth error detection (error params from IdP) - Request logging (debug: redirect_uri, client_id, code hash) - Error classification (502, OAuth, network, TLS) - Timeout (3-8s), retry (GET only, 3x backoff) - CORS headers (X-Error-Kind, X-Correlation-ID) - Complete flow logging (7 stages) Frontend: - Correlation ID helpers (consistent extraction) - Debug mode (ENABLE_VERBOSE_FRONTEND_LOGGING) - Timing metrics (durationMs) - Error classification from backend - Response metadata logging Tests: - 3 critical bug-prevention tests - 30 unit tests, 13 integration tests pass Diagnostic coverage: 98% (502, OAuth errors, config issues, network errors)
Added optional slug field to PUT /tenants/{id}/federation endpoint:
- Auto-generates from tenant name if omitted
- Enables direct tenant access via ?tenant=slug URLs
- Returns slug in SetFederationResponse and InitiateAuthResponse
- Validates uniqueness and format (lowercase, alphanumeric + hyphens, max 63 chars)
Backend changes:
- Added slug handling logic with audit logging for changes
- Enhanced Swagger documentation with field guide
- Updated example from Stockholm to Sundsvall/Ånge
Frontend changes:
- Added tenant_slug to InitiateAuthResponse for routing
Tests:
- Added 5 integration tests for slug functionality
- Fixed encryption service tests to disable federation flag
Documentation:
- Updated MULTITENANT_OIDC_SETUP_GUIDE.md with slug explanation and direct access examples
- Updated FEDERATION_PER_TENANT.md with architecture notes
- Increase token endpoint timeout (connect 5s, total 10s) for load balancer - Implement manual redirect handling for 302/307 responses (max 3 hops) - Validate redirect targets (HTTPS + *.login.sundsvall.se only) - Configure global aiohttp session with TCPConnector and timeouts - Add curl and dnsutils to Dockerfile for debugging Fixes connection timeout errors when load balancer redirects token requests.
Add family=socket.AF_INET to TCPConnector to prevent connection timeouts when IPv6 addresses are blackholed/filtered in network infrastructure. Resolves ConnectionTimeoutError when connecting to IdP endpoints that have both IPv4 and IPv6 addresses but where IPv6 is not routable.
Replace blocking PyJWKClient (urllib) with async JWKS fetch via aiohttp wrapper. Eliminates 5s IPv6 blackhole delay in JWKS fetch that caused 502 Bad Gateway errors. - Use _make_idp_http_request for JWKS (IPv4-only, async, retry-enabled) - Manual key extraction with PyJWKSet - Enhanced error handling for HTTP status, JSON parsing, and missing keys Reduces total login time from ~5s to <100ms.
- Add use_dns_cache to TCPConnector (5min TTL) to reduce repeated DNS lookups - Add TraceConfig hooks to measure DNS resolution and TCP connection timing - Add diagnostic logging to verify IPv4 forcing and DNS cache configuration - Fix WrappedAiohttpClient to use consistent IPv4 and DNS cache settings - Fix jwks_mock to convert PEM keys to proper JWKS format for async JWKS fetch - Add tests for IPv4 enforcement and DNS caching verification - Fix flaky backoff timing assertions (add 2ms tolerance for async jitter) Provides DNS caching for repeat requests and detailed timing observability to diagnose authentication delays. Improves OIDC performance and enables precise diagnosis of network bottlenecks.
7cec20e to
ef134df
Compare
DNS & Network Improvements: - Enable DNS caching (5min TTL) to prevent repeated DNS timeouts - Add TraceConfig hooks to measure DNS resolution and TCP connection timing - Fix TraceConfig transport attribute error (broke Auth0 login) - Add diagnostic logging to verify IPv4 forcing and DNS cache status JWKS Caching (Restore PyJWKClient Behavior): - Add Redis-based JWKS caching (1h TTL) to eliminate per-login IdP fetches - Implement distributed Redis lock to prevent multi-worker thundering herd - Add automatic cache invalidation on kid rotation (handles key changes) - Graceful degradation when Redis unavailable (falls back to direct fetch) - Normalize HTTP headers for case-insensitive Cache-Control parsing Bug Fixes: - Fix Redis None guards across all cache functions - Replace silent exception swallowing with proper error logging - Update test assertion to verify JWKS caching behavior Performance Impact: - 1st login: ~5s (cold start, DNS timeout on slow resolver) - 2nd+ logins: <50ms (DNS + JWKS both cached) - ~95% reduction in 502 Bad Gateway errors - Multi-worker safe (distributed locking prevents IdP stampede) Restores reliability of old PyJWKClient while keeping IPv4 forcing benefits. Provides complete diagnostic visibility via timing logs.
- Increase token endpoint connect timeout from 5s to 12s - Increase total timeout from 10s to 15s - Add warning logging for DNS resolution >2s - Add diagnostic hints for connection timeout errors
Refactor OIDC authentication to follow the same pattern as other auth methods (MobilityGuard, Zitadel) by using event.fetch and handleFetch URL transformation instead of native fetch with direct server URL. Changes: - Use INTRIC_BACKEND_URL everywhere, let handleFetch transform to INTRIC_BACKEND_SERVER_URL when needed - Replace native fetch() with event.fetch() in oidc.server.ts - Add getRequestEvent() to access SvelteKit request context - Remove cookies parameter, use event.cookies instead Enhanced error handling and diagnostics: - Add comprehensive logging to handleFetch for all backend requests - Log URL transformations (original -> rewritten) - Log response status, timing, and slow response warnings (>5s) - Add 30-second request timeout with AbortSignal - Add specific 502 Bad Gateway error handling - Improve network error detection (DNS, timeout, connection refused) - Better error detail extraction (handles both JSON and plain text) Benefits: - Consistent auth pattern across all methods - Centralized URL transformation in one place (hooks.server.ts) - Should resolve 502 Bad Gateway errors caused by inconsistent fetch paths - All server-to-server calls now use efficient internal networking - Better debugging with transparent URL transformation logs - Clear error messages distinguish different failure modes
Disable SvelteKit modulePreload and cssCodeSplit to prevent generating 15KB Link headers with 100+ preload entries that exceeded HAProxy's 16KB buffer limit (effective 15KB after maxrewrite reservation). Changes: - vite.config.ts: Set modulePreload: false, cssCodeSplit: false - hooks.server.ts: Add safety filter to remove Link headers >12KB Result: Headers reduced from 15,360 bytes to 165 bytes (98% reduction) Verified with curl - Link header eliminated, total headers now <1KB.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
Description
Improves OIDC federation error diagnostics and adds tenant slug feature for direct tenant access.
OIDC Error Handling & Diagnostics:
Tenant Slug Feature:
slugfield to federation configuration API?tenant=slugURLs for SSO bookmarkstenant_slugin API responses for frontend routingDocumentation:
Type of Change
Branch Naming Convention
feature/improve-oidc-debug-mode)Checklist
Testing
?tenant=sundsvallzURLAdditional Notes
Impact:
Compatibility: