Skip to content

Latest commit

 

History

History
168 lines (121 loc) · 5.93 KB

File metadata and controls

168 lines (121 loc) · 5.93 KB

Thread-Safety Bug Fix Verification

Bug Summary

Issue: Middleware used instance variables (@session_id, @visitor_id, @request) shared across concurrent requests in multi-threaded servers (Puma). This caused race conditions where session/visitor IDs leaked between requests.

Impact: Pet Resorts Australia had 178,428 sessions for only 172,000 visitors. Some visitors had 1,500+ sessions because cookies were set with wrong session_ids under concurrent load.

Root Cause: Rack middleware is instantiated once and shared across all requests. Instance variables are not thread-safe.

Fix Details

Field Value
Fixed in version 0.6.3
Commit bdf4c64
Fix deployed 2025-12-22 ~21:30 UTC (2025-12-23 ~08:30 AEDT)
Gem published 2025-12-23

Verification Checklist

After 24-48 hours (by 2025-12-25):

Run these queries in production Rails console:

# 1. Check session creation rate AFTER fix
# Should see dramatically fewer sessions per hour
cutoff = Time.parse("2025-12-23 08:30:00 UTC")  # Adjust to actual deploy time

puts "Sessions BEFORE fix (last 24h before deploy):"
before_sessions = Session.where(created_at: (cutoff - 24.hours)..cutoff).count
puts "  Count: #{before_sessions}"

puts "\nSessions AFTER fix (24h after deploy):"
after_sessions = Session.where(created_at: cutoff..(cutoff + 24.hours)).count
puts "  Count: #{after_sessions}"

puts "\nReduction: #{((before_sessions - after_sessions).to_f / before_sessions * 100).round(1)}%"
# 2. Check sessions per visitor ratio
# Should be close to 1.0-1.5 for new visitors (was 1.05 overall but outliers had 1500+)
cutoff = Time.parse("2025-12-23 08:30:00 UTC")

new_visitors = Visitor.where("created_at > ?", cutoff)
new_visitor_ids = new_visitors.pluck(:id)

sessions_for_new = Session.where(visitor_id: new_visitor_ids).count
puts "New visitors since fix: #{new_visitors.count}"
puts "Sessions for new visitors: #{sessions_for_new}"
puts "Ratio: #{(sessions_for_new.to_f / new_visitors.count).round(2)}"
# 3. Check for any new outliers (visitors with 10+ sessions in 24h)
cutoff = Time.parse("2025-12-23 08:30:00 UTC")

outliers = Session.where("created_at > ?", cutoff)
  .group(:visitor_id)
  .having("count(*) > 10")
  .count

puts "Visitors with 10+ sessions since fix: #{outliers.count}"
outliers.sort_by { |_, v| -v }.first(5).each do |vid, count|
  puts "  Visitor #{vid}: #{count} sessions"
end

Expected Results After Fix:

  • Session creation rate drops by 90%+
  • Sessions per new visitor ratio < 2.0
  • No new outliers with 100+ sessions
  • Cookie session_id matches env session_id (verified by tests)

Other SDKs to Review

CRITICAL: Check all other SDKs for the same thread-safety bug!

SDK Review Checklist:

SDK Location Status Reviewed By Date
mbuzz-ruby /Users/vlad/code/m/mbuzz-ruby FIXED Claude 2025-12-22
mbuzz-python /Users/vlad/code/m/mbuzz-python SAFE Claude 2025-12-22
mbuzz-php /Users/vlad/code/m/mbuzz-php SAFE Claude 2025-12-22
mbuzz-node /Users/vlad/code/m/mbuzz-node SAFE Claude 2025-12-22

Review Results:

mbuzz-python: SAFE

  • Uses contextvars.ContextVar for thread-safe context storage
  • Uses Flask's g object for request-scoped storage
  • Local variables used throughout middleware
  • Async session creation captures values in local variables before spawning thread

mbuzz-php: SAFE

  • PHP is single-process per request by default
  • No shared state between requests
  • Each request gets fresh instance of everything

mbuzz-node: SAFE

  • Uses AsyncLocalStorage from node:async_hooks for async request isolation
  • Express middleware uses local variables (visitor, session, secure)
  • Attaches data to request-scoped req.mbuzz object
  • createSessionAsync captures values as function parameters before setImmediate
  • Node.js is single-threaded, so race conditions are inherently less likely

What to Look For:

  1. Middleware/Handler using instance variables or class variables for request-specific data

    • BAD: self.session_id = ... or @session_id = ...
    • GOOD: Local variables passed through function calls
  2. Mutable shared state

    • BAD: Global or class-level dicts/hashes storing request data
    • GOOD: Request-scoped context objects or local variables
  3. Thread-local storage without proper cleanup

    • Check that thread-local data is cleared after each request

Python-specific concerns:

  • Check for module-level variables
  • Check Flask/Django middleware for shared state
  • WSGI apps can have similar issues with global state

PHP-specific concerns:

  • PHP is typically single-threaded per request, so likely SAFE
  • But check for any persistent worker modes (Swoole, RoadRunner, FrankenPHP)

Node.js-specific concerns:

  • Node is single-threaded, so likely SAFE
  • But check for any shared state in closures or module scope

Data Cleanup (Optional)

After verifying the fix works, consider cleaning up the bad data:

# Find sessions with no events (likely created by the bug)
# BE CAREFUL - only run after thorough analysis

# Count empty sessions by account
Account.find_each do |account|
  session_ids_with_events = account.events.distinct.pluck(:session_id)
  empty_sessions = account.sessions.where.not(session_id: session_ids_with_events).count
  total_sessions = account.sessions.count

  next if empty_sessions == 0

  puts "#{account.name}: #{empty_sessions}/#{total_sessions} empty sessions (#{(empty_sessions.to_f/total_sessions*100).round(1)}%)"
end

Notes

  • Bug was discovered via dashboard metrics investigation (avg visits showing 28.5 with 0.6 avg days)
  • Traced to Pet Resorts Australia account (PetPro360)
  • Logs showed session creation every few seconds with different session_ids
  • Test added: test_race_condition_with_slow_app - 49/50 failures before fix, 0 after