PSv2: Implement queue clean-up upon job completion by carlosgjs · Pull Request #1113 · RolnickLab/antenna

carlosgjs · 2026-02-03T23:10:57Z

Summary

Performs clean-up of NATS and Redis resources used by async jobs

Related Issues

Closes #1083

Testing

NATS Dashboard With job running:

After job finished:

Redis with job running:

127.0.0.1:6379> KEYS *job*
1) ":1:job:66:pending_images:results"
2) ":1:job:66:pending_images:process"
3) ":1:job:66:pending_images_total"

With job complete:

127.0.0.1:6379> KEYS *job*
(empty array)

Job logs:

[2026-02-03 21:58:56] INFO Cleaned up NATS resources for job 66
[2026-02-03 21:58:56] INFO Cleaned up Redis state for job 66

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

Summary by CodeRabbit

Bug Fixes
- Improved cleanup of async resources for ML jobs to ensure proper resource management when jobs complete, fail, or are revoked. Resources are now reliably released across all job completion states.
Tests
- Added comprehensive tests to validate resource cleanup across success, failure, and revocation scenarios.

netlify · 2026-02-03T23:11:02Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`165fa96`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/6986a9403dd1d3000745447c

netlify · 2026-02-03T23:11:03Z

✅ Deploy Preview for antenna-preview ready!

Name	Link
🔨 Latest commit	`165fa96`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6986a940ad484c000825787a
😎 Deploy Preview	https://deploy-preview-1113--antenna-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
Lighthouse	1 paths audited Performance: 60 (🔴 down 6 from production) Accessibility: 80 (no change from production) Best Practices: 100 (no change from production) SEO: 92 (no change from production) PWA: 80 (no change from production) View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2026-02-03T23:11:06Z

📝 Walkthrough

Walkthrough

This pull request implements cleanup of async job resources (NATS streams/consumers and Redis keys) when ML jobs complete, fail, or are revoked. The core cleanup logic is refactored into a unified function, integrated into task status handlers, and validated with comprehensive tests covering all completion scenarios.

Changes

Cohort / File(s)	Summary
Task Orchestration `ami/jobs/tasks.py`	Added `_cleanup_job_if_needed()` helper that conditionally triggers async resource cleanup. Integrated cleanup invocations into `_update_job_progress()` (100% completion), `update_job_status()` (revocation), and `update_job_failure()` (failure). Added `JobState` import for state checking.
Cleanup Implementation `ami/ml/orchestration/jobs.py`	Renamed `cleanup_nats_resources()` to `cleanup_async_job_resources()` and extended functionality to perform both Redis and NATS cleanup. Added `TaskStateManager.cleanup()` for Redis state removal and wrapped `TaskQueueManager` async context for NATS resource deletion. Updated return type to boolean indicating success of all cleanup operations.
Integration Tests `ami/ml/orchestration/tests/test_cleanup.py`	New test module validating async resource cleanup across three job completion paths: successful completion (100% progress), failure, and revocation. Tests verify both Redis key presence/absence and NATS stream/consumer lifecycle through Django integration tests with mock pipelines and images.

Sequence Diagram

sequenceDiagram
    participant Job as Job Completion Event
    participant Tasks as tasks.py<br/>(Orchestration)
    participant Cleanup as jobs.py<br/>(cleanup_async_job_resources)
    participant Redis as TaskStateManager<br/>(Redis)
    participant NATS as TaskQueueManager<br/>(NATS)
    
    Note over Job,NATS: Completion triggered by: progress==100% OR failure OR revocation
    
    Job->>Tasks: _update_job_progress() / update_job_status() / update_job_failure()
    Tasks->>Tasks: Check if job_type=="ml" & async_pipeline_workers enabled
    alt Cleanup needed
        Tasks->>Cleanup: _cleanup_job_if_needed(job)
        Cleanup->>Redis: cleanup() - remove task state keys
        activate Redis
        Redis-->>Cleanup: redis_success (bool)
        deactivate Redis
        Cleanup->>NATS: TaskQueueManager context - delete streams/consumers
        activate NATS
        NATS-->>Cleanup: nats_success (bool)
        deactivate NATS
        Cleanup-->>Tasks: return redis_success AND nats_success
    else No cleanup needed
        Tasks->>Tasks: Skip cleanup
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: Properly handle async job state with celery tasks #1114: Modifies ami/jobs/tasks.py with JobState import and async completion semantics, providing complementary job-status-handling logic that coordinates with this cleanup implementation.

Poem

🐰 Hop, hop—when jobs complete their dance,
We scrub the Redis, clean the NATS!
No streams left hanging, keys now gone,
Fresh resources for jobs to come. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.06% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: implementing cleanup of NATS/Redis queues when async jobs complete.
Description check	✅ Passed	The description includes a summary of changes, references the related issue (`#1083`), provides comprehensive testing evidence with NATS/Redis screenshots and job logs, and completes all checklist items.
Linked Issues check	✅ Passed	The PR successfully implements the core requirement from `#1083`: cleanup of NATS queues and Redis resources upon job completion, with cleanup triggered at three job endpoints (completion, failure, revocation).
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing job resource cleanup: new cleanup helper function, renamed/extended orchestration method, and comprehensive integration tests validating the cleanup functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR implements automatic cleanup of NATS JetStream and Redis resources when async ML jobs complete, fail, or are cancelled. This addresses issue #1083 by ensuring that temporary resources used for job orchestration are properly removed after jobs finish.

Changes:

Renamed cleanup_nats_resources to cleanup_async_job_resources to handle both NATS and Redis cleanup
Integrated cleanup into job lifecycle at three points: completion, failure, and revocation
Added comprehensive integration tests covering all three cleanup scenarios

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
ami/ml/orchestration/jobs.py	Enhanced cleanup function to handle both Redis and NATS resources, with proper error handling and logging
ami/jobs/tasks.py	Integrated cleanup calls in job completion, failure, and revocation handlers with feature flag checks
ami/ml/orchestration/test_cleanup.py	Added comprehensive integration tests verifying cleanup works correctly in all three scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@ami/ml/orchestration/test_cleanup.py`:
- Around line 118-159: In _verify_resources_cleaned, change the broad exception
handling inside the async check_nats_resources (which calls
manager.js.stream_info and manager.js.consumer_info via TaskQueueManager) to
only treat nats.js.errors.NotFoundError as "not found" (set
stream_exists/consumer_exists = False) and re-raise any other exceptions so
connection/infra errors fail the test; import or reference NotFoundError from
nats.js.errors and use it in the except clauses for the respective stream and
consumer checks.

🧹 Nitpick comments (1)

ami/ml/orchestration/jobs.py (1)

33-53: Capture stack traces on cleanup failures for easier diagnosis.

job.logger.error drops the traceback; job.logger.exception preserves context without changing behavior.

🔧 Suggested update

-    except Exception as e:
-        job.logger.error(f"Error cleaning up Redis state for job {job.pk}: {e}")
+    except Exception:
+        job.logger.exception(f"Error cleaning up Redis state for job {job.pk}")
...
-    except Exception as e:
-        job.logger.error(f"Error cleaning up NATS resources for job {job.pk}: {e}")
+    except Exception:
+        job.logger.exception(f"Error cleaning up NATS resources for job {job.pk}")

ami/ml/orchestration/tests/test_cleanup.py

* merge * feat: PSv2 - Queue/redis clean-up upon job completion * fix: catch specific exception * chore: move tests to a subdir --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Michael Bunsen <notbot@gmail.com>

* merge * Update ML job counts in async case * Update date picker version and tweak layout logic (#1105) * fix: update date picker version and tweak layout logic * feat: set start month based on selected date * fix: Properly handle async job state with celery tasks (#1114) * merge * fix: Properly handle async job state with celery tasks * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Delete implemented plan --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * PSv2: Implement queue clean-up upon job completion (#1113) * merge * feat: PSv2 - Queue/redis clean-up upon job completion * fix: catch specific exception * chore: move tests to a subdir --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Michael Bunsen <notbot@gmail.com> * fix: PSv2: Workers should not try to fetch tasks from v1 jobs (#1118) Introduces the dispatch_mode field on the Job model to track how each job dispatches its workload. This allows API clients (including the AMI worker) to filter jobs by dispatch mode — for example, fetching only async_api jobs so workers don't pull synchronous or internal jobs. JobDispatchMode enum (ami/jobs/models.py): internal — work handled entirely within the platform (Celery worker, no external calls). Default for all jobs. sync_api — worker calls an external processing service API synchronously and waits for each response. async_api — worker publishes items to NATS for external processing service workers to pick up independently. Database and Model Changes: Added dispatch_mode CharField with TextChoices, defaulting to internal, with the migration in ami/jobs/migrations/0019_job_dispatch_mode.py. ML jobs set dispatch_mode = async_api when the project's async_pipeline_workers feature flag is enabled. ML jobs set dispatch_mode = sync_api on the synchronous processing path (previously unset). API and Filtering: dispatch_mode is exposed (read-only) in job list and detail serializers. Filterable via query parameter: ?dispatch_mode=async_api The /tasks endpoint now returns 400 for non-async_api jobs, since only those have NATS tasks to fetch. Architecture doc: docs/claude/job-dispatch-modes.md documents the three modes, naming decisions, and per-job-type mapping. --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Michael Bunsen <notbot@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * PSv2 cleanup: use is_complete() and dispatch_mode in job progress handler (#1125) * refactor: use is_complete() and dispatch_mode in job progress handler Replace hardcoded `stage == "results"` check with `job.progress.is_complete()` which verifies ALL stages are done, making it work for any job type. Replace feature flag check in cleanup with `dispatch_mode == ASYNC_API` which is immutable for the job's lifetime and more correct than re-reading a mutable flag that could change between job creation and completion. Co-Authored-By: Claude <noreply@anthropic.com> * test: update cleanup tests for is_complete() and dispatch_mode checks Set dispatch_mode=ASYNC_API on test jobs to match the new cleanup guard. Complete all stages (collect, process, results) in the completion test since is_complete() correctly requires all stages to be done. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * track captures and failures * Update tests, CR feedback, log error images * CR feedback * fix type checking * refactor: rename _get_progress to _commit_update in TaskStateManager Clarify naming to distinguish mutating vs read-only methods: - _commit_update(): private, writes mutations to Redis, returns progress - get_progress(): public, read-only snapshot (added in #1129) - update_state(): public API, acquires lock, calls _commit_update() Co-Authored-By: Claude <noreply@anthropic.com> * fix: unify FAILURE_THRESHOLD and convert TaskProgress to dataclass - Single FAILURE_THRESHOLD constant in tasks.py, imported by models.py - Fix async path to use `> FAILURE_THRESHOLD` (was `>=`) to match the sync path's boundary behavior at exactly 50% - Convert TaskProgress from namedtuple to dataclass with defaults, so new fields don't break existing callers Co-Authored-By: Claude <noreply@anthropic.com> * refactor: rename TaskProgress to JobStateProgress Clarify that this dataclass tracks job-level progress in Redis, not individual task/image progress. Aligns with the naming of JobProgress (the Django/Pydantic model equivalent). Co-Authored-By: Claude <noreply@anthropic.com> * docs: update NATS todo and planning docs with session learnings Mark connection handling as done (PR #1130), add worktree/remote mapping and docker testing notes for future sessions. Co-Authored-By: Claude <noreply@anthropic.com> * Rename TaskStateManager to AsyncJobStateManager * Track results counts in the job itself vs Redis * small simplification * Reset counts to 0 on reset * chore: remove local planning docs from PR branch Co-Authored-By: Claude <noreply@anthropic.com> * docs: clarify three-layer job state architecture in docstrings Explain the relationship between AsyncJobStateManager (Redis), JobProgress (JSONB), and JobState (enum). Clarify that all counts in JobStateProgress refer to source images (captures). Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Anna Viklund <annamariaviklund@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Michael Bunsen <notbot@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

* merge * feat: PSv2 - Queue/redis clean-up upon job completion * fix: catch specific exception * chore: move tests to a subdir --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Michael Bunsen <notbot@gmail.com>

* merge * Update ML job counts in async case * Update date picker version and tweak layout logic (RolnickLab#1105) * fix: update date picker version and tweak layout logic * feat: set start month based on selected date * fix: Properly handle async job state with celery tasks (RolnickLab#1114) * merge * fix: Properly handle async job state with celery tasks * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Delete implemented plan --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * PSv2: Implement queue clean-up upon job completion (RolnickLab#1113) * merge * feat: PSv2 - Queue/redis clean-up upon job completion * fix: catch specific exception * chore: move tests to a subdir --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Michael Bunsen <notbot@gmail.com> * fix: PSv2: Workers should not try to fetch tasks from v1 jobs (RolnickLab#1118) Introduces the dispatch_mode field on the Job model to track how each job dispatches its workload. This allows API clients (including the AMI worker) to filter jobs by dispatch mode — for example, fetching only async_api jobs so workers don't pull synchronous or internal jobs. JobDispatchMode enum (ami/jobs/models.py): internal — work handled entirely within the platform (Celery worker, no external calls). Default for all jobs. sync_api — worker calls an external processing service API synchronously and waits for each response. async_api — worker publishes items to NATS for external processing service workers to pick up independently. Database and Model Changes: Added dispatch_mode CharField with TextChoices, defaulting to internal, with the migration in ami/jobs/migrations/0019_job_dispatch_mode.py. ML jobs set dispatch_mode = async_api when the project's async_pipeline_workers feature flag is enabled. ML jobs set dispatch_mode = sync_api on the synchronous processing path (previously unset). API and Filtering: dispatch_mode is exposed (read-only) in job list and detail serializers. Filterable via query parameter: ?dispatch_mode=async_api The /tasks endpoint now returns 400 for non-async_api jobs, since only those have NATS tasks to fetch. Architecture doc: docs/claude/job-dispatch-modes.md documents the three modes, naming decisions, and per-job-type mapping. --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Michael Bunsen <notbot@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * PSv2 cleanup: use is_complete() and dispatch_mode in job progress handler (RolnickLab#1125) * refactor: use is_complete() and dispatch_mode in job progress handler Replace hardcoded `stage == "results"` check with `job.progress.is_complete()` which verifies ALL stages are done, making it work for any job type. Replace feature flag check in cleanup with `dispatch_mode == ASYNC_API` which is immutable for the job's lifetime and more correct than re-reading a mutable flag that could change between job creation and completion. Co-Authored-By: Claude <noreply@anthropic.com> * test: update cleanup tests for is_complete() and dispatch_mode checks Set dispatch_mode=ASYNC_API on test jobs to match the new cleanup guard. Complete all stages (collect, process, results) in the completion test since is_complete() correctly requires all stages to be done. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * track captures and failures * Update tests, CR feedback, log error images * CR feedback * fix type checking * refactor: rename _get_progress to _commit_update in TaskStateManager Clarify naming to distinguish mutating vs read-only methods: - _commit_update(): private, writes mutations to Redis, returns progress - get_progress(): public, read-only snapshot (added in RolnickLab#1129) - update_state(): public API, acquires lock, calls _commit_update() Co-Authored-By: Claude <noreply@anthropic.com> * fix: unify FAILURE_THRESHOLD and convert TaskProgress to dataclass - Single FAILURE_THRESHOLD constant in tasks.py, imported by models.py - Fix async path to use `> FAILURE_THRESHOLD` (was `>=`) to match the sync path's boundary behavior at exactly 50% - Convert TaskProgress from namedtuple to dataclass with defaults, so new fields don't break existing callers Co-Authored-By: Claude <noreply@anthropic.com> * refactor: rename TaskProgress to JobStateProgress Clarify that this dataclass tracks job-level progress in Redis, not individual task/image progress. Aligns with the naming of JobProgress (the Django/Pydantic model equivalent). Co-Authored-By: Claude <noreply@anthropic.com> * docs: update NATS todo and planning docs with session learnings Mark connection handling as done (PR RolnickLab#1130), add worktree/remote mapping and docker testing notes for future sessions. Co-Authored-By: Claude <noreply@anthropic.com> * Rename TaskStateManager to AsyncJobStateManager * Track results counts in the job itself vs Redis * small simplification * Reset counts to 0 on reset * chore: remove local planning docs from PR branch Co-Authored-By: Claude <noreply@anthropic.com> * docs: clarify three-layer job state architecture in docstrings Explain the relationship between AsyncJobStateManager (Redis), JobProgress (JSONB), and JobState (enum). Clarify that all counts in JobStateProgress refer to source images (captures). Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com> Co-authored-by: Anna Viklund <annamariaviklund@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Michael Bunsen <notbot@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

carlos-irreverentlabs and others added 4 commits January 16, 2026 11:25

merge

b60eab0

Merge remote-tracking branch 'upstream/main'

644927f

Merge remote-tracking branch 'upstream/main'

218f7aa

feat: PSv2 - Queue/redis clean-up upon job completion

e2849dc

carlosgjs marked this pull request as ready for review February 4, 2026 03:03

Copilot AI review requested due to automatic review settings February 4, 2026 03:03

Copilot started reviewing on behalf of carlosgjs February 4, 2026 03:03 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

ami/ml/orchestration/tests/test_cleanup.py Show resolved Hide resolved

carlosgjs requested a review from mihow February 5, 2026 01:37

mihow added 3 commits February 6, 2026 18:42

fix: catch specific exception

42bf700

chore: move tests to a subdir

1cf5df6

Merge branch 'main' into carlosg/cleannats

165fa96

mihow approved these changes Feb 7, 2026

View reviewed changes

mihow merged commit c92a9f7 into RolnickLab:main Feb 7, 2026
7 checks passed

coderabbitai bot mentioned this pull request Feb 10, 2026

PSv2 cleanup: use is_complete() and dispatch_mode in job progress handler #1125

Merged

3 tasks

This was referenced Feb 11, 2026

PSv2: Track and display image count progress and state #1121

Merged

PSv2: Use connection pooling and retries for NATS #1130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSv2: Implement queue clean-up upon job completion#1113

PSv2: Implement queue clean-up upon job completion#1113
mihow merged 7 commits intoRolnickLab:mainfrom
uw-ssec:carlosg/cleannats

carlosgjs commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

netlify bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

netlify bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

carlosgjs commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Testing

Checklist

Summary by CodeRabbit

Uh oh!

netlify bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview ready!

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

carlosgjs commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

netlify bot commented Feb 3, 2026 •

edited

Loading

netlify bot commented Feb 3, 2026 •

edited

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading