Skip to content

Conversation

@ofir-frd
Copy link

Benchmark PR Significant-Gravitas#11326

Type: Clean (correct implementation)

Original PR Title: hotfix(backend): fix rate-limited messages blocking queue by republishing to back
Original PR Description: ## Summary
Fix critical queue blocking issue where rate-limited user messages prevent other users' executions from being processed, causing the 135 late executions reported in production.

Root Cause Analysis

When a user exceeds max_concurrent_graph_executions_per_user (25), the executor uses basic_nack(requeue=True) which sends the message to the FRONT of the RabbitMQ queue. This creates an infinite blocking loop where:

  1. Rate-limited message goes to front of queue
  2. Gets processed, hits rate limit again
  3. Goes back to front of queue
  4. Blocks all other users' messages indefinitely

Solution Implementation

🔧 Core Changes

  • New setting: requeue_by_republishing (default: True) in backend/util/settings.py
  • Smart _ack_message: Automatically uses republishing when requeue=True and setting enabled
  • Efficient implementation: Uses existing self.run_client connection instead of creating new ones
  • Integration test: Real RabbitMQ test validates queue ordering behavior

🔄 Technical Implementation

Before (blocking):

basic_nack(delivery_tag, requeue=True)  # Goes to FRONT of queue ❌

After (non-blocking):

if requeue and self.config.requeue_by_republishing:
    # First: Republish to BACK of queue
    self.run_client.publish_message(...)
    # Then: Reject without requeue
    basic_nack(delivery_tag, requeue=False)

📊 Impact

  • Other users' executions no longer blocked by rate-limited users
  • Fair queue processing - FIFO behavior maintained for all users
  • Rate limiting still works - just doesn't block others
  • Configurable - can revert to old behavior with requeue_by_republishing=False
  • Zero performance impact - uses existing connections

Test Plan

  • Integration test: test_requeue_integration.py validates real RabbitMQ queue ordering
  • Scenario testing: Confirms rate-limited messages go to back of queue
  • Cross-user validation: Verifies other users' messages process correctly
  • Setting test: Confirms configuration loads with correct defaults

Deployment Strategy

This is a hotfix that can be deployed immediately:

  • Backward compatible: Old behavior available via config
  • Safe default: New behavior is safer than current state
  • No breaking changes: All existing functionality preserved
  • Immediate relief: Resolves production queue blocking

Files Modified

  • backend/executor/manager.py: Enhanced _ack_message logic and _requeue_message_to_back method
  • backend/util/settings.py: Added requeue_by_republishing configuration field
  • test_requeue_integration.py: Integration test for queue ordering validation

Related Issues

Fixes the 135 late executions issue where messages were stuck in QUEUED state despite available executor capacity (583m/600m utilization).

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com
Original PR URL: Significant-Gravitas#11326

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants