Issue: Middleware used instance variables (@session_id, @visitor_id, @request) shared across concurrent requests in multi-threaded servers (Puma). This caused race conditions where session/visitor IDs leaked between requests.
Impact: Pet Resorts Australia had 178,428 sessions for only 172,000 visitors. Some visitors had 1,500+ sessions because cookies were set with wrong session_ids under concurrent load.
Root Cause: Rack middleware is instantiated once and shared across all requests. Instance variables are not thread-safe.
| Field | Value |
|---|---|
| Fixed in version | 0.6.3 |
| Commit | bdf4c64 |
| Fix deployed | 2025-12-22 ~21:30 UTC (2025-12-23 ~08:30 AEDT) |
| Gem published | 2025-12-23 |
Run these queries in production Rails console:
# 1. Check session creation rate AFTER fix
# Should see dramatically fewer sessions per hour
cutoff = Time.parse("2025-12-23 08:30:00 UTC") # Adjust to actual deploy time
puts "Sessions BEFORE fix (last 24h before deploy):"
before_sessions = Session.where(created_at: (cutoff - 24.hours)..cutoff).count
puts " Count: #{before_sessions}"
puts "\nSessions AFTER fix (24h after deploy):"
after_sessions = Session.where(created_at: cutoff..(cutoff + 24.hours)).count
puts " Count: #{after_sessions}"
puts "\nReduction: #{((before_sessions - after_sessions).to_f / before_sessions * 100).round(1)}%"# 2. Check sessions per visitor ratio
# Should be close to 1.0-1.5 for new visitors (was 1.05 overall but outliers had 1500+)
cutoff = Time.parse("2025-12-23 08:30:00 UTC")
new_visitors = Visitor.where("created_at > ?", cutoff)
new_visitor_ids = new_visitors.pluck(:id)
sessions_for_new = Session.where(visitor_id: new_visitor_ids).count
puts "New visitors since fix: #{new_visitors.count}"
puts "Sessions for new visitors: #{sessions_for_new}"
puts "Ratio: #{(sessions_for_new.to_f / new_visitors.count).round(2)}"# 3. Check for any new outliers (visitors with 10+ sessions in 24h)
cutoff = Time.parse("2025-12-23 08:30:00 UTC")
outliers = Session.where("created_at > ?", cutoff)
.group(:visitor_id)
.having("count(*) > 10")
.count
puts "Visitors with 10+ sessions since fix: #{outliers.count}"
outliers.sort_by { |_, v| -v }.first(5).each do |vid, count|
puts " Visitor #{vid}: #{count} sessions"
end- Session creation rate drops by 90%+
- Sessions per new visitor ratio < 2.0
- No new outliers with 100+ sessions
- Cookie session_id matches env session_id (verified by tests)
CRITICAL: Check all other SDKs for the same thread-safety bug!
| SDK | Location | Status | Reviewed By | Date |
|---|---|---|---|---|
| mbuzz-ruby | /Users/vlad/code/m/mbuzz-ruby |
FIXED | Claude | 2025-12-22 |
| mbuzz-python | /Users/vlad/code/m/mbuzz-python |
SAFE | Claude | 2025-12-22 |
| mbuzz-php | /Users/vlad/code/m/mbuzz-php |
SAFE | Claude | 2025-12-22 |
| mbuzz-node | /Users/vlad/code/m/mbuzz-node |
SAFE | Claude | 2025-12-22 |
mbuzz-python: SAFE
- Uses
contextvars.ContextVarfor thread-safe context storage - Uses Flask's
gobject for request-scoped storage - Local variables used throughout middleware
- Async session creation captures values in local variables before spawning thread
mbuzz-php: SAFE
- PHP is single-process per request by default
- No shared state between requests
- Each request gets fresh instance of everything
mbuzz-node: SAFE
- Uses
AsyncLocalStoragefromnode:async_hooksfor async request isolation - Express middleware uses local variables (
visitor,session,secure) - Attaches data to request-scoped
req.mbuzzobject createSessionAsynccaptures values as function parameters beforesetImmediate- Node.js is single-threaded, so race conditions are inherently less likely
-
Middleware/Handler using instance variables or class variables for request-specific data
- BAD:
self.session_id = ...or@session_id = ... - GOOD: Local variables passed through function calls
- BAD:
-
Mutable shared state
- BAD: Global or class-level dicts/hashes storing request data
- GOOD: Request-scoped context objects or local variables
-
Thread-local storage without proper cleanup
- Check that thread-local data is cleared after each request
- Check for module-level variables
- Check Flask/Django middleware for shared state
- WSGI apps can have similar issues with global state
- PHP is typically single-threaded per request, so likely SAFE
- But check for any persistent worker modes (Swoole, RoadRunner, FrankenPHP)
- Node is single-threaded, so likely SAFE
- But check for any shared state in closures or module scope
After verifying the fix works, consider cleaning up the bad data:
# Find sessions with no events (likely created by the bug)
# BE CAREFUL - only run after thorough analysis
# Count empty sessions by account
Account.find_each do |account|
session_ids_with_events = account.events.distinct.pluck(:session_id)
empty_sessions = account.sessions.where.not(session_id: session_ids_with_events).count
total_sessions = account.sessions.count
next if empty_sessions == 0
puts "#{account.name}: #{empty_sessions}/#{total_sessions} empty sessions (#{(empty_sessions.to_f/total_sessions*100).round(1)}%)"
end- Bug was discovered via dashboard metrics investigation (avg visits showing 28.5 with 0.6 avg days)
- Traced to Pet Resorts Australia account (PetPro360)
- Logs showed session creation every few seconds with different session_ids
- Test added:
test_race_condition_with_slow_app- 49/50 failures before fix, 0 after