-
Notifications
You must be signed in to change notification settings - Fork 68
Description
Description
Currently, internal/gtfs/realtime.go suffers from severe lock contention under load, leading to thread exhaustion and high p99 latency for API readers.
The rebuildMergedRealtimeLocked function is called to rebuild the global routing maps (realTimeTripLookup and realTimeVehicleLookupByTrip). However, it performs this O(N) map allocation and data copying while holding an exclusive write lock (realTimeMutex.Lock()).
The Bottleneck
When processing a real-world GTFS-RT feed with tens of thousands of active trips, the map allocation takes several milliseconds. During this critical section:
- Every incoming HTTP API request attempting to read state (e.g., calling
GetRealTimeTrips) blocks while waiting for theRLock. - This creates a massive queue of blocked goroutines.
- When the write lock is finally released, the "thundering herd" of blocked readers wakes up simultaneously, trashing the Go scheduler and causing cascading timeouts.
Note: This architectural bottleneck perfectly aligns with the 75%+ failure rate and 1-minute latency spikes documented in docs/mutex_contention_analysis.md.
Proposed Solution: Lock-Free Copy-On-Write (COW)
To eliminate the reader starvation, we should move the O(N) allocation entirely out of the critical section using a Copy-On-Write pattern with atomic.Value.
Implementation Steps:
- Group the lookup maps into a single state struct (e.g.,
RealTimeState). - Store this struct in the API manager using
atomic.Value. - Inside the rebuild function, allocate and populate the new maps in local memory without acquiring the global lock.
- Once the new maps are fully built, perform an O(1) atomic pointer swap to make them active.
This ensures that API readers never block waiting for background feed processing, keeping read latency strictly bounded to O(1) uncontended atomic loads.
@aaronbrethorst @Ahmedhossamdev
Should I work on this?