fix: PSv2 follow-up fixes from integration tests#1135
Draft
Conversation
- Add connect_timeout=5, allow_reconnect=False to NATS connections to prevent leaked reconnection loops from blocking Django's event loop - Guard /tasks endpoint against terminal-status jobs (return empty tasks instead of attempting NATS reserve) - IncompleteJobFilter now excludes jobs by top-level status in addition to progress JSON stages - Add stale worker cleanup to integration test script Found during PSv2 integration testing where stale ADC workers with default DataLoader parallelism overwhelmed the single uvicorn worker thread by flooding /tasks with concurrent NATS reserve requests. Co-Authored-By: Claude <noreply@anthropic.com>
Session notes from 2026-02-16 integration test including root cause analysis of stale worker task competition and NATS connection issues. Findings doc tracks applied fixes and remaining TODOs with priorities. Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for antenna-preview canceled.
|
✅ Deploy Preview for antenna-ssec canceled.
|
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PSv2 integration test passed end-to-end (job 1380, 20/20 images). Identified ack_wait=300s as cause of ~5min idle time when GPU processes race for NATS tasks. Co-Authored-By: Claude <noreply@anthropic.com>
Replace N×1 reserve_task() calls with single reserve_tasks() batch fetch. The previous implementation created a new pull subscription per message (320 NATS round trips for batch=64), causing the /tasks endpoint to exceed HTTP client timeouts. The new approach uses one psub.fetch() call for the entire batch. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The PSv2 seems to work sometimes, but some events cause no tasks to be returned, NATS to timeout or Django to become unresponsive.
Here are some findings from my session 2 out of 3. More to come!
Fixes discovered during PSv2 integration testing where stale ADC workers overwhelmed Django by flooding
/taskswith concurrent NATS reserve requests.get_connection()now usesconnect_timeout=5, allow_reconnect=Falseto prevent leaked reconnection loops from blocking Django's async event loop/tasksendpoint returns empty for terminal-status jobs instead of attempting NATS reserveIncompleteJobFilterexcludes jobs by top-level status (not just progress JSON stages), so workers don't pick up manually-failed jobsAMI_NUM_WORKERS=0Context
During integration testing, a stale ADC worker from a previous test run consumed all 20 NATS messages before the new worker could fetch them. The stale worker spawned multiple DataLoader subprocesses, generating 147 concurrent
/tasksrequests against Django's single uvicorn worker thread. Combined with NATS connections that defaulted toallow_reconnect=True, this blocked the entire event loop and made Django unresponsive.TODO & TO-TRY
_ensure_stream()/_ensure_consumer()operationsasync_to_sync()NATS calls withasyncio.wait_for()timeout--workers 4to uvicorn dev config (optional? currently 1 is required for debugger)/tasksmulti-pipeline supportdispatch_modeon job init, notrun()Test plan
scripts/psv2_integration_test.sh 20with clean state (no stale workers)incomplete_only=1excludes FAILURE/SUCCESS/REVOKED jobsFull findings:
docs/claude/planning/nats-flooding-prevention.md