Implement class masking using the post-processing framework#999
Implement class masking using the post-processing framework#999mohamedelabbas1996 wants to merge 49 commits intomainfrom
Conversation
…en creating a terminal classification with the rolled up taxon
…gress, and algorithm binding
…f.logger and progress updates
…g and progress tracking
…ss-masking branch)
There was a problem hiding this comment.
@mohamedelabbas1996 here is how I updated all of the determinations previously
…ocessing-class-masking
…lass-masking # Conflicts: # ami/ml/tests.py
✅ Deploy Preview for antenna-ssec canceled.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds two new post-processing tasks (class masking and rank rollup), a management command and admin actions/UI to run them, registry updates, removal of a package export, API serializer fields, templates, and expanded tests covering these workflows. Changes
Sequence DiagramssequenceDiagram
participant Admin
participant Server as Django Admin
participant Queue as PostProcessing Registry
participant Task as RankRollupTask
participant ClassDB as Classifications DB
participant TaxonDB as Taxa DB
participant DB as Database
Admin->>Server: trigger run_rank_rollup(selected_collections)
Server->>Queue: enqueue RankRollupTask with config
Queue->>Task: run task for collection_id
Task->>ClassDB: query terminal classifications for collection
ClassDB-->>Task: return classifications + category_maps
Task->>TaxonDB: resolve labels to Taxon objects & parent chains
Task->>Task: aggregate scores by rank, apply thresholds
alt rollup chosen
Task->>DB: mark related classifications non-terminal
Task->>DB: create new rolled-up terminal classification(s)
end
Task-->>Server: report updated occurrences
sequenceDiagram
participant CLI as run_class_masking (management cmd)
participant Task as ClassMaskingTask
participant CollDB as SourceImageCollection
participant ClassDB as Classifications
participant AlgDB as Algorithm (category_map)
participant TaxaDB as TaxaList
participant DB as Database
CLI->>CLI: parse args, validate collection/taxa_list/algorithm
CLI->>Task: instantiate ClassMaskingTask (or dry-run)
Task->>ClassDB: query terminal classifications for collection & algorithm
ClassDB-->>Task: return classifications
Task->>AlgDB: load category_map
Task->>TaxaDB: load taxa list, compute filtered category_map
Task->>Task: mask logits, recompute softmax, create new classifications
Task->>DB: bulk save new classifications, mark old ones non-terminal
Task-->>CLI: report completion / summary
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Add 4 class masking tests: score redistribution, accuracy improvement, no-op when all species in list, and softmax mathematical correctness - Add `run_class_masking` management command with --collection-id, --taxa-list-id, --algorithm-id, and --dry-run options - Register RankRollupTask in post-processing registry - Remove non-existent `register_postprocessing_task` import/decorator from rank_rollup.py Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (3)
ami/ml/post_processing/class_masking.py (2)
149-161: Variable shadowing:logitsandscoresare reassigned, hiding the originals from line 126.
logitsis first bound fromclassification.logits(line 126), then rebound tologits_np.tolist()(line 151). Similarlyscoresis rebound at line 161. This makes the code confusing—use distinct names likemasked_logitsandmasked_scores.Proposed fix
- logits: list[float] = logits_np.tolist() + masked_logits: list[float] = logits_np.tolist() ... - scores: list = scores_np.tolist() + masked_scores: list[float] = scores_np.tolist()Then update all downstream references accordingly (lines 163, 166, 174, 186-188).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/class_masking.py` around lines 149 - 161, The code rebinds existing variables logits and scores (originally from classification.logits) which causes variable shadowing and confusion; rename the converted/filtered values (e.g., change logits_np.tolist() to masked_logits and scores_np.tolist() to masked_scores), keep logits_np and scores_np as-is, and update all downstream uses that currently reference logits or scores to use masked_logits and masked_scores (also check any other places using logits or scores after the masking block and adjust them to the new names).
48-75:jobparameter is accepted but unused.
update_occurrences_in_collectionaccepts ajobparameter (line 55) but never uses it. Either remove it or pass it along for progress tracking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/class_masking.py` around lines 48 - 75, The function update_occurrences_in_collection currently accepts a job parameter but never uses it; either remove the unused parameter or propagate it for progress tracking—preferably pass job through to downstream processing so callers can monitor progress. Modify update_occurrences_in_collection (and any TODO block that creates the AlgorithmCategoryMap) to accept and forward job to make_classifications_filtered_by_taxa_list by adding job=job to that call and, if needed, update make_classifications_filtered_by_taxa_list's signature (and any other helper functions) to accept and use job for progress/log updates; alternatively, if you choose to remove it, delete the job parameter from update_occurrences_in_collection and update all callers accordingly.ami/ml/post_processing/rank_rollup.py (1)
104-106: Remove commented-out debugger import.Proposed fix
threshold = thresholds.get(rank, 1.0) - # import pdb - - # pdb.set_trace() candidates = {t: s for t, s in taxon_scores.items() if t.rank == rank}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/rank_rollup.py` around lines 104 - 106, Remove the leftover commented debugger lines in the rank rollup module: delete the lines containing "# import pdb" and "# pdb.set_trace()" in ami/ml/post_processing/rank_rollup.py (look for the commented debugger import and trace calls near the post-processing/rank rollup logic) so no commented debugger statements remain; ensure no other pdb references are left in functions or methods within that module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ami/main/admin.py`:
- Around line 738-757: The DEFAULT_THRESHOLDS in the admin action
run_rank_rollup uses lowercase keys ("species","genus","family") but
RankRollupTask expects uppercase keys via RankRollupTask.ROLLUP_ORDER; update
DEFAULT_THRESHOLDS to use the same uppercase keys (e.g.,
"SPECIES","GENUS","FAMILY") or transform keys before creating the Job params so
they match RankRollupTask.ROLLUP_ORDER/lookup; ensure the symbol
DEFAULT_THRESHOLDS and the params payload in run_rank_rollup are consistent with
RankRollupTask.ROLLUP_ORDER used in rank_rollup.py.
- Around line 429-474: The admin action update_with_newfoundland_species
hardcodes TaxaList and Algorithm names and uses a bare except Exception; change
it to accept configurable taxa_list and algorithm (e.g., read from settings,
admin action parameters, or a constant at module top) instead of using the
literal "Newfoundland Species" and "Quebec & Vermont Species Classifier - Apr
2024", and add a TODO or link to your tracking issue to avoid shipping hardcoded
values; also replace the broad except in the loop with a narrower exception (or
catch Exception only to log full traceback via logging.exception and then
re-raise or continue) so programming errors aren't silently swallowed—update
references: update_with_newfoundland_species, TaxaList.objects.get,
Algorithm.objects.get, and update_single_occurrence accordingly.
In `@ami/ml/post_processing/class_masking.py`:
- Line 129: Move the numpy imports out of the loop and to module level: add
"import numpy as np", "from numpy import exp", and "from numpy import sum as
np_sum" at the top of the file, then delete the in-loop statements "import numpy
as np", "from numpy import exp", and "from numpy import sum" (the imports
currently inside the for loop). Ensure subsequent code uses np, exp, and np_sum
consistently after the change.
- Around line 170-171: Remove the two debug print statements that output "Top
taxon" and "Top index" in ami/ml/post_processing/class_masking.py; either delete
the lines that call print(...) or replace them with a proper logger.debug(...)
call (using the module/class logger) so the variables
category_map_with_taxa[top_index] and top_index are not printed to stdout in
production.
- Around line 206-208: Replace the root-logger call logging.info(...) with the
module-level logger call logger.info(...) where the message "Adding new
classification for Taxon {top_taxon} to occurrence
{new_classification.detection.occurrence}" is emitted; locate this call in the
function that creates new_classification and ensure the module-level logger
(logger defined at top of the module) is used so module-specific logging
configuration is respected.
- Around line 183-197: The new Classification is being created with
category_map=None which breaks methods like predictions() and top_n() because
Classification.save() (which auto-populates category_map from
algorithm.category_map) isn't called when later using bulk_create; update the
creation logic in class_masking.py so the new Classification's category_map is
explicitly set from new_algorithm.category_map (or otherwise ensure
Classification.save() runs before bulk_create) when constructing
new_classification (referencing Classification, new_classification,
new_algorithm, and Classification.save()) so downstream calls to
.predictions()/.top_n() have a valid category_map.
- Around line 84-112: The queryset taxa_in_list is iterated with "not in"
causing repeated DB hits and len(classifications) forces full evaluation; change
taxa_in_list = taxa_list.taxa.all() to a concrete set (e.g., a set of taxon PKs
or Taxon instances) and update the exclusion check in the comprehension that
builds excluded_category_map_with_taxa to test membership against that set (use
category["taxon"].pk or category["taxon"] depending on which identity you
store), and replace len(classifications) with classifications.count() to avoid
loading the whole QuerySet into memory; update any related variables like
category_map_with_taxa and excluded_category_indices to work with the chosen
identity (pk vs instance).
In `@ami/ml/post_processing/rank_rollup.py`:
- Around line 37-38: DEFAULT_THRESHOLDS and ROLLUP_ORDER use uppercase rank keys
while external callers pass lowercase, causing thresholds.get(rank, 1.0) to miss
matches; fix by normalizing incoming threshold keys to the same case as
ROLLUP_ORDER/DEFAULT_THRESHOLDS (e.g., convert provided thresholds to upper()
with something like thresholds = {k.upper(): v for k, v in thresholds.items()}
before any lookups/merging), and ensure any merging logic that references
DEFAULT_THRESHOLDS (and places where thresholds.get(rank, 1.0) is used) operates
on the normalized dict so rollup comparisons use the intended values.
- Around line 138-143: The code assumes clf.detection.occurrence is always
present and will raise when it's None; update the block around
updated_occurrences.append(occurrence) and the self.logger.info call to guard
against a missing occurrence: retrieve occurrence = clf.detection.occurrence,
then if occurrence is not None append it and log the rolled-up message using
occurrence.pk, otherwise handle the null case (e.g., skip appending and log a
different message referencing clf.detection.pk or clf.pk and the
new_taxon/new_score) so no AttributeError is raised when occurrence is None.
- Around line 70-71: The log formatting crashes when Classification.score is
None; update the logger.info call inside the loop (the for i, clf in
enumerate(qs.iterator(), start=1) block) to handle a nullable clf.score by
rendering a safe value—e.g., compute a short_score variable that is
f"{clf.score:.3f}" if clf.score is not None else "N/A" (or 0.0) and use that in
the message instead of directly formatting clf.score with : .3f so None never
raises TypeError.
- Around line 82-96: The per-label DB hit comes from
Taxon.objects.filter(name=label).first() inside the loop over
clf.scores/clf.category_map.labels; build a single mapping of labels to Taxon
before iterating classifications by collecting all labels (from
clf.category_map.labels across classifications), querying
Taxon.objects.filter(name__in=labels) once to produce a dict label_to_taxon,
then replace Taxon.objects.filter(...).first() with label_to_taxon.get(label)
and keep the existing handling (log skip when None) while leaving the rest of
the logic (rollup_order, find_ancestor_by_parent_chain, taxon_scores
accumulation, and debug logging) unchanged.
In `@ami/ml/tests.py`:
- Around line 973-1009: The thresholds dict keys are case-sensitive causing
lookups in RankRollupTask to miss lowercase keys and fall back to 1.0; fix this
by normalizing threshold keys or lookup to be case-insensitive — e.g., in the
RankRollupTask initializer or wherever you read thresholds (the thresholds
parameter and the code that uses thresholds.get(...)), convert keys to a
canonical form (upper-case) or use thresholds.get(taxon.rank.upper(),
default_threshold) when computing aggregate scores; after fixing, update the
test inputs if needed so genus/species scores exercise the intended non-1.0
thresholds.
---
Nitpick comments:
In `@ami/ml/post_processing/class_masking.py`:
- Around line 149-161: The code rebinds existing variables logits and scores
(originally from classification.logits) which causes variable shadowing and
confusion; rename the converted/filtered values (e.g., change logits_np.tolist()
to masked_logits and scores_np.tolist() to masked_scores), keep logits_np and
scores_np as-is, and update all downstream uses that currently reference logits
or scores to use masked_logits and masked_scores (also check any other places
using logits or scores after the masking block and adjust them to the new
names).
- Around line 48-75: The function update_occurrences_in_collection currently
accepts a job parameter but never uses it; either remove the unused parameter or
propagate it for progress tracking—preferably pass job through to downstream
processing so callers can monitor progress. Modify
update_occurrences_in_collection (and any TODO block that creates the
AlgorithmCategoryMap) to accept and forward job to
make_classifications_filtered_by_taxa_list by adding job=job to that call and,
if needed, update make_classifications_filtered_by_taxa_list's signature (and
any other helper functions) to accept and use job for progress/log updates;
alternatively, if you choose to remove it, delete the job parameter from
update_occurrences_in_collection and update all callers accordingly.
In `@ami/ml/post_processing/rank_rollup.py`:
- Around line 104-106: Remove the leftover commented debugger lines in the rank
rollup module: delete the lines containing "# import pdb" and "#
pdb.set_trace()" in ami/ml/post_processing/rank_rollup.py (look for the
commented debugger import and trace calls near the post-processing/rank rollup
logic) so no commented debugger statements remain; ensure no other pdb
references are left in functions or methods within that module.
- Move numpy imports to module level (class_masking.py) - Replace print() with logger.debug() - Use .count() instead of len() for QuerySet evaluation - Convert taxa_in_list to set() for O(1) lookups - Add logits__isnull=False to queryset filters - Set category_map on new classifications (was None) - Fix error message clarity and remove extra parens - Use logger.info() instead of root logging.info() - Wrap bulk_update/bulk_create in transaction.atomic() - Normalize threshold keys to uppercase in RankRollupTask - Guard clf.score formatting against None - Guard against null occurrence in rank rollup - Remove commented-out pdb lines - Fix admin.py threshold keys to uppercase - Fix test scores to exercise intended thresholds - Clean up stale TODO comments and dead code Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (5)
ami/ml/post_processing/class_masking.py (2)
51-75:jobandparamsparameters are accepted but never used.
jobis accepted (line 58) but never referenced for progress reporting, andparams(line 55) is only logged. If progress tracking is intended, wirejobinto the loop. Otherwise, remove unused parameters to avoid confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/class_masking.py` around lines 51 - 75, The function update_occurrences_in_collection accepts job and params but never uses them; either remove these parameters or wire them up: if progress reporting is intended, pass the job into make_classifications_filtered_by_taxa_list (or use job.update_progress / task_logger within the loop inside make_classifications_filtered_by_taxa_list) and use params to control behavior (e.g., filter options or batch sizes) when building classifications; otherwise remove job and params from the signature and all call sites to avoid dead API surface. Ensure you update references to update_occurrences_in_collection and make_classifications_filtered_by_taxa_list so their signatures match and document the chosen behavior.
184-186:assertstatements will be stripped underpython -O; use explicit checks.Lines 184–186 use
assertfor runtime validation of detection/occurrence presence. These are silently removed when Python runs with optimizations enabled. Since this guards data integrity before bulk operations, use explicitif/raiseinstead.Proposed fix
- assert new_classification.detection is not None - assert new_classification.detection.occurrence is not None - occurrences_to_update.add(new_classification.detection.occurrence) + if new_classification.detection is None: + raise ValueError(f"New classification for {classification.pk} has no detection.") + if new_classification.detection.occurrence is None: + raise ValueError(f"Detection {new_classification.detection.pk} has no occurrence.") + occurrences_to_update.add(new_classification.detection.occurrence)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/class_masking.py` around lines 184 - 186, Replace the runtime asserts that check new_classification.detection and new_classification.detection.occurrence with explicit checks that raise a clear exception (e.g., ValueError or RuntimeError) so they are not stripped under python -O; specifically, in the block where new_classification is processed (the code referencing new_classification.detection, new_classification.detection.occurrence and occurrences_to_update.add(...)), change the two assert lines to if-not checks that raise including contextual information (e.g., include new_classification id or index) before adding to occurrences_to_update.ami/ml/tests.py (1)
1182-1187: Missinglogits__isnull=Falsefilter — inconsistent with production code path.
update_occurrences_in_collectioninclass_masking.py(line 67) filters withlogits__isnull=False, but this test queryset at lines 1182–1187 omits it. While the test data always has logits, adding the filter would keep the test closer to the production query and help catch regressions.Proposed fix
classifications = Classification.objects.filter( detection__occurrence=occ, terminal=True, algorithm=self.algorithm, scores__isnull=False, + logits__isnull=False, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/tests.py` around lines 1182 - 1187, The test queryset using Classification.objects.filter(detection__occurrence=occ, terminal=True, algorithm=self.algorithm, scores__isnull=False) should mirror production by also filtering out null logits; update the filter call in the test to include logits__isnull=False so the query matches the one used in update_occurrences_in_collection in class_masking.py and will catch regressions related to missing logits.ami/ml/post_processing/rank_rollup.py (2)
13-26: Parent-chain traversal triggers a DB query per hop.Each call to
current.parentis a lazy FK lookup — one query per ancestor level. When called inside the nested loop (line 95), this multiplies with the number of labels × ranks × classifications.Consider prefetching or caching the parent chain. For example, if
Taxonstoresparents_jsonor a materialized path, use that instead. Otherwise, a single upfront query building ataxon_id → parentmap would eliminate the repeated hits.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/rank_rollup.py` around lines 13 - 26, The parent-chain traversal in find_ancestor_by_parent_chain causes a DB query per current.parent access; modify the implementation to avoid lazy FK hops by either using Taxon.parents_json or a materialized path if available, or change the function signature to accept a prebuilt mapping (e.g., taxon_id → parent_id/object) and traverse that map instead of current.parent; update callers (the nested loop that invokes find_ancestor_by_parent_chain) to pass the preloaded map or parents_json after performing a single upfront query that fetches all taxa and parent relationships, so ancestor lookups are O(depth) in-memory rather than multiple DB queries.
70-152: Long-running transaction wraps the entire classification loop including per-row DB writes.The
transaction.atomic()block spans the full iteration over potentially thousands of classifications, each performing anUPDATE(line 126) and aCREATE(line 127). This holds a database transaction open for the entire duration, which can cause lock contention and connection pool exhaustion under load.Consider batching: accumulate the updates/creates and perform them in smaller atomic batches (similar to how
class_masking.pyusesbulk_update/bulk_createoutside the loop).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/rank_rollup.py` around lines 70 - 152, The transaction.atomic() currently wraps the entire loop causing long transactions; change to process classifications in small batches: iterate the queryset without a global atomic, accumulate pending Classification updates (e.g., mark non-terminal) and new Classification instances (for create) into lists (reference the Classification.objects.filter(...).update(...) and Classification.objects.create(...) usages), and every N items (configurable batch_size) run a short transaction.atomic() that performs bulk_update or bulk_create for the accumulated items and clears the buffers; ensure you still append updated occurrences to updated_occurrences and call update_progress at the same batch boundary so behavior remains the same while avoiding one long transaction.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ami/ml/post_processing/class_masking.py`:
- Around line 238-250: self.algorithm.category_map is only updated in memory
(assignment of algorithm.category_map to self.algorithm.category_map) so the
Algorithm DB row stays stale; after the in-memory assignment you should persist
the change by saving the Algorithm model (e.g., call
self.algorithm.save(update_fields=["category_map"])) or alternatively avoid
mutating self.algorithm and pass algorithm.category_map explicitly into
update_occurrences_in_collection/Classification creation; update the code around
the assignment and before calling update_occurrences_in_collection to ensure the
Algorithm DB record's category_map FK is saved.
- Around line 148-155: The equality check using classification.scores == scores
and classification.logits == logits is fragile due to floating-point
differences; replace it with a tolerance-based comparison (e.g., numpy.allclose
or an elementwise isclose with a small rtol/atol) when comparing
classification.scores to scores and classification.logits to logits inside the
block around top_index/top_taxon; alternatively or additionally, check whether
any indices that would be masked had non-negligible original logits (e.g.,
abs(logit) > tiny_threshold) before deciding to skip updating to avoid spurious
updates from rounding noise.
- Around line 109-113: The logger.info call is forcing materialization of the
classifications QuerySet by using len(classifications); replace that with the
previously computed classification_count (or call classifications.count()) so
the QuerySet is not evaluated into memory—update the f-string in the logging
statement where logger.info is called (referencing category_map_with_taxa,
excluded_category_map_with_taxa, and classifications/classification_count) to
use classification_count instead of len(classifications).
In `@ami/ml/post_processing/rank_rollup.py`:
- Around line 59-63: The query building variable qs uses
Classification.objects.filter(...) joining through
detection__source_image__collections which can produce duplicate Classification
rows; update the query in rank_rollup.py (the qs assignment using
Classification.objects.filter with terminal=True, taxon__isnull=False,
detection__source_image__collections__id=collection_id) to call .distinct()
after the filter to deduplicate results (match the pattern used by
update_occurrences_in_collection in class_masking.py).
In `@ami/ml/tests.py`:
- Around line 942-971: The category map is built from species_taxa using the
model's default ordering which can differ from the tests that explicitly use
.order_by("name") — update the query in _create_category_map_with_algorithm to
fetch species_taxa =
list(self.project.taxa.filter(rank=TaxonRank.SPECIES.name).order_by("name")[:3])
(and apply the same .order_by("name") change to any other similar query that
builds category maps or compares indices, e.g., the other occurrence around
lines 1025-1037) so the AlgorithmCategoryMap.data and labels indices align with
the test's expected species ordering; keep building AlgorithmCategoryMap and
Algorithm exactly as before.
---
Duplicate comments:
In `@ami/main/admin.py`:
- Around line 429-474: The admin action update_with_newfoundland_species
currently swallows all exceptions with a broad "except Exception as e" around
update_single_occurrence calls; when you generalize this action replace that
broad handler with a narrower one such as "except (ValueError, RuntimeError,
TaxaList.DoesNotExist) as e" and keep the existing self.message_user(request,
f'Error processing occurrence {occurrence.pk}: {str(e)}', level='error') pattern
(or add logging/traceback) so single failures still report but you avoid hiding
unexpected exceptions from update_single_occurrence and related code paths.
In `@ami/ml/post_processing/rank_rollup.py`:
- Around line 84-98: The loop issues an N+1 by calling
Taxon.objects.filter(name=label).first() for every label; instead, gather the
set of all labels used by clf.category_map.labels before iterating
classifications and fetch all matching Taxon rows once (e.g.,
Taxon.objects.filter(name__in=labels) and build a dict mapping name→Taxon), then
replace Taxon.objects.filter(name=label).first() with a lookup into that mapping
(handle missing names by logging and continuing). Update references in the
function containing clf.scores and clf.category_map.labels so taxon lookup uses
the preloaded dict.
---
Nitpick comments:
In `@ami/ml/post_processing/class_masking.py`:
- Around line 51-75: The function update_occurrences_in_collection accepts job
and params but never uses them; either remove these parameters or wire them up:
if progress reporting is intended, pass the job into
make_classifications_filtered_by_taxa_list (or use job.update_progress /
task_logger within the loop inside make_classifications_filtered_by_taxa_list)
and use params to control behavior (e.g., filter options or batch sizes) when
building classifications; otherwise remove job and params from the signature and
all call sites to avoid dead API surface. Ensure you update references to
update_occurrences_in_collection and make_classifications_filtered_by_taxa_list
so their signatures match and document the chosen behavior.
- Around line 184-186: Replace the runtime asserts that check
new_classification.detection and new_classification.detection.occurrence with
explicit checks that raise a clear exception (e.g., ValueError or RuntimeError)
so they are not stripped under python -O; specifically, in the block where
new_classification is processed (the code referencing
new_classification.detection, new_classification.detection.occurrence and
occurrences_to_update.add(...)), change the two assert lines to if-not checks
that raise including contextual information (e.g., include new_classification id
or index) before adding to occurrences_to_update.
In `@ami/ml/post_processing/rank_rollup.py`:
- Around line 13-26: The parent-chain traversal in find_ancestor_by_parent_chain
causes a DB query per current.parent access; modify the implementation to avoid
lazy FK hops by either using Taxon.parents_json or a materialized path if
available, or change the function signature to accept a prebuilt mapping (e.g.,
taxon_id → parent_id/object) and traverse that map instead of current.parent;
update callers (the nested loop that invokes find_ancestor_by_parent_chain) to
pass the preloaded map or parents_json after performing a single upfront query
that fetches all taxa and parent relationships, so ancestor lookups are O(depth)
in-memory rather than multiple DB queries.
- Around line 70-152: The transaction.atomic() currently wraps the entire loop
causing long transactions; change to process classifications in small batches:
iterate the queryset without a global atomic, accumulate pending Classification
updates (e.g., mark non-terminal) and new Classification instances (for create)
into lists (reference the Classification.objects.filter(...).update(...) and
Classification.objects.create(...) usages), and every N items (configurable
batch_size) run a short transaction.atomic() that performs bulk_update or
bulk_create for the accumulated items and clears the buffers; ensure you still
append updated occurrences to updated_occurrences and call update_progress at
the same batch boundary so behavior remains the same while avoiding one long
transaction.
In `@ami/ml/tests.py`:
- Around line 1182-1187: The test queryset using
Classification.objects.filter(detection__occurrence=occ, terminal=True,
algorithm=self.algorithm, scores__isnull=False) should mirror production by also
filtering out null logits; update the filter call in the test to include
logits__isnull=False so the query matches the one used in
update_occurrences_in_collection in class_masking.py and will catch regressions
related to missing logits.
| top_index = scores.index(max(scores)) | ||
| top_taxon = category_map_with_taxa[top_index]["taxon"] | ||
| logger.debug(f"Top taxon: {category_map_with_taxa[top_index]}, index: {top_index}") | ||
|
|
||
| # check if needs updating | ||
| if classification.scores == scores and classification.logits == logits: | ||
| logger.debug(f"Classification {classification.pk} does not need updating") | ||
| continue |
There was a problem hiding this comment.
Float comparison for "no change" detection is fragile.
Line 153 compares classification.scores == scores after recomputing softmax from logits via numpy. Even when no masking occurs, floating-point rounding differences between the original scores (possibly computed on a different platform/precision) and the numpy recomputation can cause false mismatches, leading to spurious new classifications.
Consider comparing with a tolerance, or checking whether any excluded indices actually had non-negligible logit values before deciding to skip.
Proposed fix (tolerance-based approach)
+ import numpy as np_check
# check if needs updating
- if classification.scores == scores and classification.logits == logits:
+ if (
+ np_check.allclose(classification.scores, scores, atol=1e-9)
+ and np_check.allclose(classification.logits, logits, atol=1e-9)
+ ):
logger.debug(f"Classification {classification.pk} does not need updating")
continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/ml/post_processing/class_masking.py` around lines 148 - 155, The equality
check using classification.scores == scores and classification.logits == logits
is fragile due to floating-point differences; replace it with a tolerance-based
comparison (e.g., numpy.allclose or an elementwise isclose with a small
rtol/atol) when comparing classification.scores to scores and
classification.logits to logits inside the block around top_index/top_taxon;
alternatively or additionally, check whether any indices that would be masked
had non-negligible original logits (e.g., abs(logit) > tiny_threshold) before
deciding to skip updating to avoid spurious updates from rounding noise.
There was a problem hiding this comment.
Claude says: Fair point about floating-point precision. In practice the comparison works here because we're comparing against the same logits that were just recomputed through the same numpy path — so if no indices were actually masked (i.e., no excluded categories had non-negligible logits), the result is bitwise identical. A tolerance-based comparison with np.allclose would be more defensive. Noting as a follow-up improvement.
There was a problem hiding this comment.
@mihow, that makes sense — when no categories are masked, the recomputation path is indeed identical and produces bitwise-equal results. Using np.allclose would guard against edge cases (e.g., if the original scores came from a different precision context or were stored/loaded differently), but if you're confident the current flow is safe, deferring it is reasonable.
Would you like me to open a follow-up issue to track the tolerance-based comparison improvement for future hardening?
|
|
||
| # @TODO temporary, do we need a new algorithm for each class mask? | ||
| self.algorithm.category_map = algorithm.category_map # Ensure the algorithm has its category map loaded | ||
|
|
||
| update_occurrences_in_collection( | ||
| collection=collection, | ||
| taxa_list=taxa_list, | ||
| algorithm=algorithm, | ||
| params=self.config, | ||
| task_logger=self.logger, | ||
| job=job, | ||
| new_algorithm=self.algorithm, | ||
| ) |
There was a problem hiding this comment.
self.algorithm.category_map is set in memory but never saved to the database.
Line 240 assigns algorithm.category_map to self.algorithm.category_map in-memory. The downstream Classification objects created via bulk_create will correctly reference this category_map FK (since they read the in-memory value). However, the Algorithm record itself remains stale in the DB — any fresh query of this algorithm will show category_map=None.
Either call self.algorithm.save(update_fields=["category_map"]) after line 240, or refactor to pass the category_map explicitly rather than mutating the algorithm.
Proposed fix
self.algorithm.category_map = algorithm.category_map # Ensure the algorithm has its category map loaded
+ self.algorithm.save(update_fields=["category_map"])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/ml/post_processing/class_masking.py` around lines 238 - 250,
self.algorithm.category_map is only updated in memory (assignment of
algorithm.category_map to self.algorithm.category_map) so the Algorithm DB row
stays stale; after the in-memory assignment you should persist the change by
saving the Algorithm model (e.g., call
self.algorithm.save(update_fields=["category_map"])) or alternatively avoid
mutating self.algorithm and pass algorithm.category_map explicitly into
update_occurrences_in_collection/Classification creation; update the code around
the assignment and before calling update_occurrences_in_collection to ensure the
Algorithm DB record's category_map FK is saved.
There was a problem hiding this comment.
Claude says: This is intentional for now — self.algorithm here is a dedicated post-processing algorithm (created by the task registry), and its category_map gets set from the source algorithm's category map for the duration of the task. Persisting it to DB would be the cleaner approach, but the @TODO on line 239 notes we're still deciding whether each class masking run should create its own algorithm or reuse one. Will address when we settle that design question.
There was a problem hiding this comment.
@mihow, thanks for the clarification! That makes sense given the TODO on line 239 — if you're still evaluating whether each masking run should get its own algorithm or reuse one, it's reasonable to defer the persistence decision until that's settled. The in-memory assignment works fine for the task's duration, and the downstream Classification objects will correctly pick up the FK reference.
I'll leave this for you to address once the design question is resolved. Thanks for the context!
There was a problem hiding this comment.
Must fix before merging. The class mask applications will likely NOT get a dynamic algorithm created.
Browser Testing: Class Masking & Admin Intermediate Page1. Management Command Test (end-to-end)Ran Dry-run mode ( 2. Admin Intermediate Page (replaces hardcoded action)The hardcoded
The dropdowns are populated dynamically from the database, so any new taxa lists or algorithms are automatically available. 3. Classifications After Masking (admin view)After running class masking, the Classification admin shows the new entries:
Changes in this update
|
Replace the hardcoded `update_with_newfoundland_species` admin action with a generic `run_class_masking` action that shows an intermediate confirmation page. Users can now select any TaxaList and Algorithm from dropdowns before running class masking on selected occurrences. Co-Authored-By: Claude <noreply@anthropic.com>
Browser Testing: Class Masking in the UIManagement CommandRan class masking on real data:
Admin: Intermediate PageReplaced the hardcoded Occurrence Detail — Identifications TabTested affected occurrences (e.g., What works well:
Observations / potential improvements:
Sample Results
|
Browser Testing: Identification Tab After Class MaskingTested occurrence detail pagesNavigated to the occurrence detail page (Identifications tab) for occurrences affected by class masking. Occurrence 43498 — determination changed from Campaea perlata to Protoboarmia porcelaria: The Identifications tab shows three predictions in order:
Occurrence 43501 — determination stayed the same (Platynota exasperatana): Key observationsWhat works well:
Issues / improvement ideas:
Admin intermediate pageThe hardcoded
Screenshots taken locally at |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
ami/main/admin.py (3)
763-782:DEFAULT_THRESHOLDSis a hidden magic constant buried inside the action.The uppercase keys now match what
RankRollupTaskexpects (the previous bug is fixed). However, definingDEFAULT_THRESHOLDSas a local variable makes the knobs invisible to operators who want to tune rollup behaviour without touching code. Promoting it to a module-level constant (or a class attribute ofSourceImageCollectionAdmin) makes it immediately discoverable.♻️ Suggested improvement
+RANK_ROLLUP_DEFAULT_THRESHOLDS = {"SPECIES": 0.8, "GENUS": 0.6, "FAMILY": 0.4} + ... `@admin.action`(description="Run Rank Rollup post-processing task (async)") def run_rank_rollup(self, request: HttpRequest, queryset: QuerySet[SourceImageCollection]) -> None: """Trigger the Rank Rollup post-processing job asynchronously.""" jobs = [] - DEFAULT_THRESHOLDS = {"SPECIES": 0.8, "GENUS": 0.6, "FAMILY": 0.4} for collection in queryset: job = Job.objects.create( ... - params={"task": "rank_rollup", "config": {"source_image_collection_id": collection.pk, "thresholds": DEFAULT_THRESHOLDS}}, + params={"task": "rank_rollup", "config": {"source_image_collection_id": collection.pk, "thresholds": RANK_ROLLUP_DEFAULT_THRESHOLDS}}, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/main/admin.py` around lines 763 - 782, DEFAULT_THRESHOLDS is defined locally inside the run_rank_rollup admin action, making tuning invisible; move it to a discoverable location (either a module-level constant named DEFAULT_THRESHOLDS or as a class attribute on SourceImageCollectionAdmin) and update run_rank_rollup to reference that constant instead of the local variable so operators can adjust thresholds without editing the method; ensure the constant uses the same uppercase keys expected by RankRollupTask.
478-484: N+1 queries: missingselect_related("category_map")on the algorithms queryset.
Algorithm.objects.filter(category_map__isnull=False)does not prefetchcategory_map. The loop on line 483 therefore issues one additionalSELECTper algorithm to lazy-load itsAlgorithmCategoryMap. Addingselect_relatedcollapses this to a single JOIN.⚡ Proposed fix
- algorithms = Algorithm.objects.filter(category_map__isnull=False).order_by("name") + algorithms = Algorithm.objects.filter(category_map__isnull=False).select_related("category_map").order_by("name")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/main/admin.py` around lines 478 - 484, The algorithms queryset is causing N+1 queries because Algorithm.objects.filter(category_map__isnull=False) does not fetch the related AlgorithmCategoryMap; update the queryset used when building alg_list to use select_related("category_map") so the category_map is joined in the same query, then keep the loop that assigns alg.labels_count = len(alg.category_map.labels) if alg.category_map else 0 (using the already-fetched relation) to avoid per-algorithm lazy loads; look for the variables/functions Algorithm, category_map, algorithms, alg_list and the loop assigning labels_count to implement this change.
503-506: Ruff RUF012: mutable class-level attributes should be annotated withClassVar.Both
inlinesandactionsare plain mutable lists assigned as class attributes, which Ruff flags. The standard Django admin idiom uses plain lists, but the warning can be silenced cleanly with aClassVarannotation.🛠️ Proposed fix
+from typing import ClassVar ... # Add classifications as inline - inlines = [DetectionInline] + inlines: ClassVar[list] = [DetectionInline] - actions = [run_class_masking] + actions: ClassVar[list] = [run_class_masking]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/main/admin.py` around lines 503 - 506, Ruff flags the mutable class-level attributes inlines and actions; update the admin class to annotate these attributes as ClassVar to silence RUF012: add "from typing import ClassVar, List" (or appropriate typing imports) and change the annotations for inlines (e.g., inlines: ClassVar[List[type[DetectionInline]]]) and actions (e.g., actions: ClassVar[List] or ClassVar[List[callable]]/List[str] as appropriate) while keeping the assigned values (inlines = [DetectionInline], actions = [run_class_masking]).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ami/templates/admin/main/class_masking_confirmation.html`:
- Around line 21-30: The markup currently nests a <ul> inside a <p> in
class_masking_confirmation.html which is invalid; fix by either replacing the
surrounding <p> with a block-level container (e.g., <div>) so the intro sentence
and the list are both within valid block content, or split into a standalone <p>
for the intro sentence followed by the <ul>; update the element(s) containing
"This will:" and the following list accordingly to preserve the same text and
semantics.
---
Duplicate comments:
In `@ami/main/admin.py`:
- Around line 466-471: The current except block in the admin handler uses a bare
"except Exception as e" which can swallow programming errors; update the
exception handling around the occurrence processing (the block that calls
self.message_user with occurrence.pk) to either catch only the specific expected
exceptions (e.g., ValueError, DatabaseError, etc.) or, if you intend a broad
catch, log the full traceback before reporting to the user (use
logging.exception or capture traceback.format_exc()) and then re-raise
unexpected exceptions as needed; ensure you still call
self.message_user(request, f"Error processing occurrence {occurrence.pk}: {e}",
level="error") after logging the detailed error.
---
Nitpick comments:
In `@ami/main/admin.py`:
- Around line 763-782: DEFAULT_THRESHOLDS is defined locally inside the
run_rank_rollup admin action, making tuning invisible; move it to a discoverable
location (either a module-level constant named DEFAULT_THRESHOLDS or as a class
attribute on SourceImageCollectionAdmin) and update run_rank_rollup to reference
that constant instead of the local variable so operators can adjust thresholds
without editing the method; ensure the constant uses the same uppercase keys
expected by RankRollupTask.
- Around line 478-484: The algorithms queryset is causing N+1 queries because
Algorithm.objects.filter(category_map__isnull=False) does not fetch the related
AlgorithmCategoryMap; update the queryset used when building alg_list to use
select_related("category_map") so the category_map is joined in the same query,
then keep the loop that assigns alg.labels_count = len(alg.category_map.labels)
if alg.category_map else 0 (using the already-fetched relation) to avoid
per-algorithm lazy loads; look for the variables/functions Algorithm,
category_map, algorithms, alg_list and the loop assigning labels_count to
implement this change.
- Around line 503-506: Ruff flags the mutable class-level attributes inlines and
actions; update the admin class to annotate these attributes as ClassVar to
silence RUF012: add "from typing import ClassVar, List" (or appropriate typing
imports) and change the annotations for inlines (e.g., inlines:
ClassVar[List[type[DetectionInline]]]) and actions (e.g., actions:
ClassVar[List] or ClassVar[List[callable]]/List[str] as appropriate) while
keeping the assigned values (inlines = [DetectionInline], actions =
[run_class_masking]).
Co-Authored-By: Claude <noreply@anthropic.com>
f9ca78f to
fc3f9e1
Compare
- Pre-load label→taxon mapping in RankRollupTask to eliminate N×M queries
- Add .distinct() to rank_rollup queryset (M2M join can produce dupes)
- Replace len(classifications) with pre-computed classification_count
- Fix invalid HTML: move <ul> outside <p> in confirmation template
- Add .order_by("name") to test helper for consistent category map indices
Co-Authored-By: Claude <noreply@anthropic.com>
Add applied_to (FK to parent classification) to ClassificationSerializer, ClassificationListSerializer, and ClassificationNestedSerializer so the UI can show post-processing lineage (e.g., "derived from [Algorithm X]"). Co-Authored-By: Claude <noreply@anthropic.com>
Add ClassificationAppliedToSerializer so applied_to returns
{id, created_at, algorithm: {id, name, ...}} instead of just an integer.
Add select_related("applied_to__algorithm") to avoid N+1 queries.
Co-Authored-By: Claude <noreply@anthropic.com>
Proposed
|
Ensures nested classifications under detections eagerly load applied_to and its algorithm to avoid N+1 queries. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (8)
ami/ml/post_processing/class_masking.py (3)
233-235:logging.exceptionalready includes the exception message — the{e}interpolation is redundant.
self.logger.exception(...)automatically appends the current exception info. Interpolating{e}duplicates it.♻️ Proposed fix
- self.logger.exception(f"Failed to load objects: {e}") + self.logger.exception("Failed to load objects")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/class_masking.py` around lines 233 - 235, The log call in the exception handler redundantly interpolates the exception because self.logger.exception already logs the exception info; in the except block that currently does self.logger.exception(f"Failed to load objects: {e}") simply remove the "{e}" interpolation and call self.logger.exception("Failed to load objects") so the message and full traceback are logged once (update the except block where self.logger.exception is used for loading objects).
129-129: Downgrade per-iteration sum logs from INFO to DEBUG.These execute once per classification (potentially thousands) and will flood logs at INFO level. Use
logger.debugfor both.♻️ Proposed fix
- logger.info(f"Previous totals: {sum(scores)} scores, {sum(logits)} logits") + logger.debug(f"Previous totals: {sum(scores)} scores, {sum(logits)} logits")- logger.info(f"New totals: {sum(scores)} scores, {sum(logits)} logits") + logger.debug(f"New totals: {sum(scores)} scores, {sum(logits)} logits")Also applies to: 145-145
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/class_masking.py` at line 129, Change the two per-iteration log calls that currently use logger.info to logger.debug so they don't flood INFO logs: replace the logger.info(f"Previous totals: {sum(scores)} scores, {sum(logits)} logits") call and the other similar logger.info call (at the second occurrence around line 145) with logger.debug using the same message formatting; keep the sums computation and formatting identical, only change the logging level.
51-58:jobparameter is accepted but never used.The
jobargument is passed at the call-site (line 248 ofrun()) but never referenced inupdate_occurrences_in_collection. Either forward it tomake_classifications_filtered_by_taxa_listif needed for progress reporting, or remove it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/class_masking.py` around lines 51 - 58, The job parameter on update_occurrences_in_collection is accepted but unused; either remove it entirely or forward it into the downstream call that likely needs it — update_occurrences_in_collection(collection, taxa_list, algorithm, params, new_algorithm, task_logger=logger, job=None) should pass job through to make_classifications_filtered_by_taxa_list (or whatever internal call performs progress reporting) so the progress reporter receives the same job object; locate the call to make_classifications_filtered_by_taxa_list inside update_occurrences_in_collection and add job as an argument (and update that callee signature if necessary), or if job is genuinely unused, remove the job parameter from update_occurrences_in_collection and from the run() call site to keep signatures consistent.ami/ml/tests.py (2)
919-940: Rename unused loop variableito_i.Ruff B007:
iis never referenced in the loop body.♻️ Proposed fix
- for i in range(num): + for _i in range(num):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/tests.py` around lines 919 - 940, The loop in _create_occurrences_with_classifications uses an unused variable named i which triggers Ruff B007; rename i to _i in the for loop header (for _i in range(num):) to indicate the variable is intentionally unused, leaving the loop body (Detection.objects.create, Occurrence.objects.create, Classification.objects.create and occurrences.append) unchanged.
1034-1034: Moveimport mathto the top of the file.
import mathappears inside four separate test methods (lines 1034, 1144, 1217, 1266). Python caches module imports, so there's no runtime cost, but it's non-idiomatic. A singleimport mathat the top of the file is cleaner.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/tests.py` at line 1034, Move the duplicate "import math" statements out of the individual test functions and add a single "import math" at the top-level imports of the test module; remove the four in-function imports so all tests use the top-level math import (search for the in-function "import math" occurrences to update them). Ensure no other name conflicts occur and run tests to confirm no behavioral changes.ami/ml/post_processing/rank_rollup.py (3)
37-38: Mutable class-level defaults should be annotated withClassVar.♻️ Proposed fix
+ from typing import ClassVar + - DEFAULT_THRESHOLDS = {"SPECIES": 0.8, "GENUS": 0.6, "FAMILY": 0.4} - ROLLUP_ORDER = ["SPECIES", "GENUS", "FAMILY"] + DEFAULT_THRESHOLDS: ClassVar[dict[str, float]] = {"SPECIES": 0.8, "GENUS": 0.6, "FAMILY": 0.4} + ROLLUP_ORDER: ClassVar[list[str]] = ["SPECIES", "GENUS", "FAMILY"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/rank_rollup.py` around lines 37 - 38, DEFAULT_THRESHOLDS and ROLLUP_ORDER are mutable class-level defaults and should be declared as ClassVar to avoid accidental per-instance mutation; update the class that defines DEFAULT_THRESHOLDS and ROLLUP_ORDER to annotate them like DEFAULT_THRESHOLDS: ClassVar[Dict[str, float]] and ROLLUP_ORDER: ClassVar[List[str]] (or the appropriate typing aliases), add the necessary "from typing import ClassVar, Dict, List" import if missing, and keep the existing constant values unchanged.
116-118:t.rank == rankshould normalize case to matchfind_ancestor_by_parent_chain.
find_ancestor_by_parent_chaindefensively calls.upper()oncurrent.rankbefore comparing (line 22). The candidates filter uses a direct equality check, creating a silent failure if anyTaxon.rankvalue is ever lowercase — rollup candidates would always be empty.♻️ Proposed fix
- candidates = {t: s for t, s in taxon_scores.items() if t.rank == rank} + candidates = {t: s for t, s in taxon_scores.items() if t.rank.upper() == rank}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/rank_rollup.py` around lines 116 - 118, The candidates filter silently fails on case mismatch because it compares Taxon.rank directly; update the comprehension that builds candidates (using rollup_order, thresholds, taxon_scores) to normalize case before comparing (e.g., compare t.rank.upper() to rank or normalize both sides) so it matches the defensive behavior used in find_ancestor_by_parent_chain; ensure you only change the equality check in the comprehension that constructs candidates = {t: s for t, s in taxon_scores.items() if ... } to use the normalized rank comparison.
83-157: Multiple N+1 FK accesses insideqs.iterator()—select_relatedis not effective here.
clf.category_map(lines 90, 97),clf.detection(lines 138–140), andclf.detection.occurrence(line 149) each trigger a separate DB query per iteration when usingiterator(), sinceselect_related()is not compatible withiterator()in older Django.Suggested mitigations:
- Category map: Build a
{pk: AlgorithmCategoryMap}cache from the already-computedcat_map_ids(see previous comment) and look up byclf.category_map_id.- Detection FK: Use
clf.detection_id(the raw FK column, no DB hit) in thefilter(detection_id=...)andcreate(detection_id=...)calls.- Occurrence: Pre-fetch occurrence PKs or consolidate occurrence updates after the loop.
♻️ Illustrative patch for category_map cache and detection_id usage
+ category_map_cache = {cm.pk: cm for cm in AlgorithmCategoryMap.objects.filter(pk__in=cat_map_ids)} with transaction.atomic(): for i, clf in enumerate(qs.iterator(), start=1): ... - if not clf.category_map: + cm = category_map_cache.get(clf.category_map_id) + if not cm: self.logger.info(f"Skipping classification #{clf.pk}: no category_map assigned") continue ... for idx, score in enumerate(clf.scores): - label = clf.category_map.labels[idx] + label = cm.labels[idx] ... ... - Classification.objects.filter(detection=clf.detection).update(terminal=False) + Classification.objects.filter(detection_id=clf.detection_id).update(terminal=False) Classification.objects.create( - detection=clf.detection, + detection_id=clf.detection_id, ... )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/ml/post_processing/rank_rollup.py` around lines 83 - 157, The loop over qs.iterator() causes N+1 DB hits for clf.category_map, clf.detection and clf.detection.occurrence because select_related() is ineffective with iterator(); fix by 1) building an in-memory cache of AlgorithmCategoryMap objects keyed by pk from the precomputed cat_map_ids and replace accesses to clf.category_map with a lookup using clf.category_map_id, 2) avoid resolving the Detection FK by using clf.detection_id in DB calls (replace Classification.objects.filter(detection=clf.detection).update(...) with filter(detection_id=clf.detection_id) and pass detection_id=clf.detection_id to Classification.objects.create), and 3) stop accessing clf.detection.occurrence inside the loop — instead collect detection_ids (or classification ids) during iteration and after the loop bulk-fetch occurrences (or occurrence pks) and append to updated_occurrences or perform consolidated updates to avoid per-row queries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ami/ml/post_processing/class_masking.py`:
- Around line 162-174: The new Classification created in class_masking.py is
missing the applied_to field, losing post-processing lineage; update the
new_classification instantiation to include applied_to referencing the original
classification (e.g., applied_to=classification) similar to how RankRollupTask
sets applied_to=clf so the created classification preserves its source lineage.
In `@ami/ml/post_processing/rank_rollup.py`:
- Around line 68-78: qs.only("category_map") still triggers N+1 because
accessing clf.category_map performs a deferred fetch per row; instead, collect
distinct category_map_id from the queryset (use
qs.values_list("category_map_id", flat=True).distinct()), fetch all
AlgorithmCategoryMap objects for those ids in one query, then iterate over the
pre-fetched maps to gather labels (avoid accessing clf.category_map directly).
Populate all_labels from the cached AlgorithmCategoryMap instances, build
label_to_taxon as before, and log using the same self.logger.info call;
reference qs.only("category_map"), clf.category_map, and AlgorithmCategoryMap to
locate and replace the logic.
---
Nitpick comments:
In `@ami/ml/post_processing/class_masking.py`:
- Around line 233-235: The log call in the exception handler redundantly
interpolates the exception because self.logger.exception already logs the
exception info; in the except block that currently does
self.logger.exception(f"Failed to load objects: {e}") simply remove the "{e}"
interpolation and call self.logger.exception("Failed to load objects") so the
message and full traceback are logged once (update the except block where
self.logger.exception is used for loading objects).
- Line 129: Change the two per-iteration log calls that currently use
logger.info to logger.debug so they don't flood INFO logs: replace the
logger.info(f"Previous totals: {sum(scores)} scores, {sum(logits)} logits") call
and the other similar logger.info call (at the second occurrence around line
145) with logger.debug using the same message formatting; keep the sums
computation and formatting identical, only change the logging level.
- Around line 51-58: The job parameter on update_occurrences_in_collection is
accepted but unused; either remove it entirely or forward it into the downstream
call that likely needs it — update_occurrences_in_collection(collection,
taxa_list, algorithm, params, new_algorithm, task_logger=logger, job=None)
should pass job through to make_classifications_filtered_by_taxa_list (or
whatever internal call performs progress reporting) so the progress reporter
receives the same job object; locate the call to
make_classifications_filtered_by_taxa_list inside
update_occurrences_in_collection and add job as an argument (and update that
callee signature if necessary), or if job is genuinely unused, remove the job
parameter from update_occurrences_in_collection and from the run() call site to
keep signatures consistent.
In `@ami/ml/post_processing/rank_rollup.py`:
- Around line 37-38: DEFAULT_THRESHOLDS and ROLLUP_ORDER are mutable class-level
defaults and should be declared as ClassVar to avoid accidental per-instance
mutation; update the class that defines DEFAULT_THRESHOLDS and ROLLUP_ORDER to
annotate them like DEFAULT_THRESHOLDS: ClassVar[Dict[str, float]] and
ROLLUP_ORDER: ClassVar[List[str]] (or the appropriate typing aliases), add the
necessary "from typing import ClassVar, Dict, List" import if missing, and keep
the existing constant values unchanged.
- Around line 116-118: The candidates filter silently fails on case mismatch
because it compares Taxon.rank directly; update the comprehension that builds
candidates (using rollup_order, thresholds, taxon_scores) to normalize case
before comparing (e.g., compare t.rank.upper() to rank or normalize both sides)
so it matches the defensive behavior used in find_ancestor_by_parent_chain;
ensure you only change the equality check in the comprehension that constructs
candidates = {t: s for t, s in taxon_scores.items() if ... } to use the
normalized rank comparison.
- Around line 83-157: The loop over qs.iterator() causes N+1 DB hits for
clf.category_map, clf.detection and clf.detection.occurrence because
select_related() is ineffective with iterator(); fix by 1) building an in-memory
cache of AlgorithmCategoryMap objects keyed by pk from the precomputed
cat_map_ids and replace accesses to clf.category_map with a lookup using
clf.category_map_id, 2) avoid resolving the Detection FK by using
clf.detection_id in DB calls (replace
Classification.objects.filter(detection=clf.detection).update(...) with
filter(detection_id=clf.detection_id) and pass detection_id=clf.detection_id to
Classification.objects.create), and 3) stop accessing clf.detection.occurrence
inside the loop — instead collect detection_ids (or classification ids) during
iteration and after the loop bulk-fetch occurrences (or occurrence pks) and
append to updated_occurrences or perform consolidated updates to avoid per-row
queries.
In `@ami/ml/tests.py`:
- Around line 919-940: The loop in _create_occurrences_with_classifications uses
an unused variable named i which triggers Ruff B007; rename i to _i in the for
loop header (for _i in range(num):) to indicate the variable is intentionally
unused, leaving the loop body (Detection.objects.create,
Occurrence.objects.create, Classification.objects.create and occurrences.append)
unchanged.
- Line 1034: Move the duplicate "import math" statements out of the individual
test functions and add a single "import math" at the top-level imports of the
test module; remove the four in-function imports so all tests use the top-level
math import (search for the in-function "import math" occurrences to update
them). Ensure no other name conflicts occur and run tests to confirm no
behavioral changes.
| new_classification = Classification( | ||
| taxon=top_taxon, | ||
| algorithm=new_algorithm, | ||
| score=max(scores), | ||
| scores=scores, | ||
| logits=logits, | ||
| detection=classification.detection, | ||
| timestamp=classification.timestamp, | ||
| terminal=True, | ||
| category_map=new_algorithm.category_map, | ||
| created_at=timestamp, | ||
| updated_at=timestamp, | ||
| ) |
There was a problem hiding this comment.
applied_to is not set on the new classification — post-processing lineage is lost.
RankRollupTask correctly sets applied_to=clf (line 146 of rank_rollup.py). Class masking omits it, so every class-masked classification will have applied_to=null, defeating the lineage field that the PR explicitly exposes in the serializers.
🐛 Proposed fix
new_classification = Classification(
taxon=top_taxon,
algorithm=new_algorithm,
score=max(scores),
scores=scores,
logits=logits,
detection=classification.detection,
timestamp=classification.timestamp,
terminal=True,
category_map=new_algorithm.category_map,
created_at=timestamp,
updated_at=timestamp,
+ applied_to=classification,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/ml/post_processing/class_masking.py` around lines 162 - 174, The new
Classification created in class_masking.py is missing the applied_to field,
losing post-processing lineage; update the new_classification instantiation to
include applied_to referencing the original classification (e.g.,
applied_to=classification) similar to how RankRollupTask sets applied_to=clf so
the created classification preserves its source lineage.
There was a problem hiding this comment.
Fix: Add missing applied_to field to class masking
- Update
ami/ml/post_processing/class_masking.pyline ~172 to includeapplied_to=classificationin thenew_classificationinstantiation, preserving post-processing lineage as intended by the PR.
💡 For code changes across multiple files, use @coderabbitai plan in an issue comment
| # Pre-load all labels from category maps to avoid N+1 queries | ||
| all_labels: set[str] = set() | ||
| for clf in qs.only("category_map"): | ||
| if clf.category_map and clf.category_map.labels: | ||
| all_labels.update(label for label in clf.category_map.labels if label) | ||
|
|
||
| label_to_taxon = {} | ||
| if all_labels: | ||
| for taxon in Taxon.objects.filter(name__in=all_labels).select_related("parent"): | ||
| label_to_taxon[taxon.name] = taxon | ||
| self.logger.info(f"Pre-loaded {len(label_to_taxon)} taxa from {len(all_labels)} unique labels") |
There was a problem hiding this comment.
qs.only("category_map") still causes N+1 — each clf.category_map access queries the DB per instance.
only() loads category_map_id (the FK column) but deferred-access to clf.category_map fetches AlgorithmCategoryMap in a separate per-row query. With 470 classifications all sharing the same category_map_id, that's 470 identical queries. Query the category maps directly instead:
🐛 Proposed fix
- # Pre-load all labels from category maps to avoid N+1 queries
- all_labels: set[str] = set()
- for clf in qs.only("category_map"):
- if clf.category_map and clf.category_map.labels:
- all_labels.update(label for label in clf.category_map.labels if label)
+ # Pre-load all labels from category maps — query the maps directly (2 queries total)
+ from ami.ml.models.algorithm import AlgorithmCategoryMap
+ cat_map_ids = qs.values_list("category_map_id", flat=True).distinct()
+ all_labels: set[str] = set()
+ for cm in AlgorithmCategoryMap.objects.filter(pk__in=cat_map_ids):
+ if cm.labels:
+ all_labels.update(label for label in cm.labels if label)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Pre-load all labels from category maps to avoid N+1 queries | |
| all_labels: set[str] = set() | |
| for clf in qs.only("category_map"): | |
| if clf.category_map and clf.category_map.labels: | |
| all_labels.update(label for label in clf.category_map.labels if label) | |
| label_to_taxon = {} | |
| if all_labels: | |
| for taxon in Taxon.objects.filter(name__in=all_labels).select_related("parent"): | |
| label_to_taxon[taxon.name] = taxon | |
| self.logger.info(f"Pre-loaded {len(label_to_taxon)} taxa from {len(all_labels)} unique labels") | |
| # Pre-load all labels from category maps — query the maps directly (2 queries total) | |
| from ami.ml.models.algorithm import AlgorithmCategoryMap | |
| cat_map_ids = qs.values_list("category_map_id", flat=True).distinct() | |
| all_labels: set[str] = set() | |
| for cm in AlgorithmCategoryMap.objects.filter(pk__in=cat_map_ids): | |
| if cm.labels: | |
| all_labels.update(label for label in cm.labels if label) | |
| label_to_taxon = {} | |
| if all_labels: | |
| for taxon in Taxon.objects.filter(name__in=all_labels).select_related("parent"): | |
| label_to_taxon[taxon.name] = taxon | |
| self.logger.info(f"Pre-loaded {len(label_to_taxon)} taxa from {len(all_labels)} unique labels") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/ml/post_processing/rank_rollup.py` around lines 68 - 78,
qs.only("category_map") still triggers N+1 because accessing clf.category_map
performs a deferred fetch per row; instead, collect distinct category_map_id
from the queryset (use qs.values_list("category_map_id", flat=True).distinct()),
fetch all AlgorithmCategoryMap objects for those ids in one query, then iterate
over the pre-fetched maps to gather labels (avoid accessing clf.category_map
directly). Populate all_labels from the cached AlgorithmCategoryMap instances,
build label_to_taxon as before, and log using the same self.logger.info call;
reference qs.only("category_map"), clf.category_map, and AlgorithmCategoryMap to
locate and replace the logic.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
ami/main/api/serializers.py (1)
877-884:ClassificationAppliedToSerializerembeds the fullAlgorithmSerializerdespite being described as "lightweight"
AlgorithmSerializerexposes 13+ fields including nestedcategory_map(itself a nested serializer),uri,version_name,task_type,description, etc. For the stated purpose of showing "Derived from [algorithm]" lineage, a minimal inline would be more appropriate and avoids pulling unnecessary data into every nested classification response.♻️ Suggested minimal algorithm representation
+class MinimalAlgorithmSerializer(serializers.ModelSerializer): + class Meta: + model = Algorithm + fields = ["id", "name", "key", "version"] + class ClassificationAppliedToSerializer(serializers.ModelSerializer): """Lightweight nested representation of the parent classification this was derived from.""" - algorithm = AlgorithmSerializer(read_only=True) + algorithm = MinimalAlgorithmSerializer(read_only=True) class Meta: model = Classification fields = ["id", "created_at", "algorithm"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/main/api/serializers.py` around lines 877 - 884, ClassificationAppliedToSerializer currently embeds the full AlgorithmSerializer (too heavy); create a lightweight nested serializer (e.g., MinimalAlgorithmSerializer) with only the minimal fields needed for lineage (such as id and a human label like name or version_name and optionally uri) and replace algorithm = AlgorithmSerializer(read_only=True) with algorithm = MinimalAlgorithmSerializer(read_only=True) inside the same module; ensure the new MinimalAlgorithmSerializer references the Algorithm model and is listed before ClassificationAppliedToSerializer so imports/resolution succeed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ami/main/api/serializers.py`:
- Around line 877-884: This is a RUF012 false positive for the mutable default
class attribute on the Meta.fields list in ClassificationAppliedToSerializer; if
CI enforces RUF012, suppress it by adding a local noqa comment on the
Meta.fields line (i.e., add "# noqa: RUF012" to the fields = ["id",
"created_at", "algorithm"] line in the ClassificationAppliedToSerializer.Meta
block) so the linter ignores this diff-scoped artifact.
In `@ami/main/api/views.py`:
- Around line 1228-1238: OccurrenceListSerializer.get_determination_details() is
triggering N+1 queries by accessing Detection.best_prediction (a cached_property
that calls Detection.predictions().order_by(...).first()) without eager-loading
applied_to and applied_to.algorithm; fix by ensuring the queryset used when
building detections for list view (or inside Detection.predictions()) prefetches
applied_to__algorithm for Classification objects, or alternatively adjust
OccurrenceListSerializer to avoid serializing applied_to in list action (e.g.,
use a lightweight nested serializer for list views); target symbols:
OccurrenceListSerializer.get_determination_details, Detection.best_prediction,
Detection.predictions, and ClassificationNestedSerializer (and the
applied_to__algorithm prefetch) so the top prediction’s applied_to and its
algorithm are loaded eagerly.
---
Nitpick comments:
In `@ami/main/api/serializers.py`:
- Around line 877-884: ClassificationAppliedToSerializer currently embeds the
full AlgorithmSerializer (too heavy); create a lightweight nested serializer
(e.g., MinimalAlgorithmSerializer) with only the minimal fields needed for
lineage (such as id and a human label like name or version_name and optionally
uri) and replace algorithm = AlgorithmSerializer(read_only=True) with algorithm
= MinimalAlgorithmSerializer(read_only=True) inside the same module; ensure the
new MinimalAlgorithmSerializer references the Algorithm model and is listed
before ClassificationAppliedToSerializer so imports/resolution succeed.
| class ClassificationAppliedToSerializer(serializers.ModelSerializer): | ||
| """Lightweight nested representation of the parent classification this was derived from.""" | ||
|
|
||
| algorithm = AlgorithmSerializer(read_only=True) | ||
|
|
||
| class Meta: | ||
| model = Classification | ||
| fields = ["id", "created_at", "algorithm"] |
There was a problem hiding this comment.
Ruff RUF012 false positive — no change needed, but suppress if CI enforces it
The fields = [...] list on line 884 triggers RUF012 ("mutable default class attribute"), but the identical pattern is used in every Meta class throughout the file without being flagged. This is a diff-scoped linter artifact, not a real risk. If CI is strict about RUF012, the quickest suppression is a # noqa: RUF012 comment on the fields line — otherwise leave it.
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 884-884: Mutable default value for class attribute
(RUF012)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/main/api/serializers.py` around lines 877 - 884, This is a RUF012 false
positive for the mutable default class attribute on the Meta.fields list in
ClassificationAppliedToSerializer; if CI enforces RUF012, suppress it by adding
a local noqa comment on the Meta.fields line (i.e., add "# noqa: RUF012" to the
fields = ["id", "created_at", "algorithm"] line in the
ClassificationAppliedToSerializer.Meta block) so the linter ignores this
diff-scoped artifact.
| "detections", | ||
| queryset=Detection.objects.order_by("-timestamp") | ||
| .select_related("source_image") | ||
| .prefetch_related( | ||
| Prefetch( | ||
| "classifications", | ||
| queryset=Classification.objects.select_related( | ||
| "taxon", "algorithm", "applied_to__algorithm" | ||
| ), | ||
| ) | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the best_prediction definition on the Occurrence model/manager
rg -n "best_prediction" --type py -A 5 -B 2Repository: RolnickLab/antenna
Length of output: 4132
🏁 Script executed:
#!/bin/bash
# Get the full best_prediction property implementation
sed -n '2912,2930p' ami/main/models.pyRepository: RolnickLab/antenna
Length of output: 945
🏁 Script executed:
#!/bin/bash
# Find ClassificationNestedSerializer definition and fields
rg -n "class ClassificationNestedSerializer" --type py -A 15Repository: RolnickLab/antenna
Length of output: 1020
🏁 Script executed:
#!/bin/bash
# Check the exact serializer method mentioned in the review
sed -n '1331,1358p' ami/main/api/serializers.pyRepository: RolnickLab/antenna
Length of output: 1288
🏁 Script executed:
#!/bin/bash
# Verify the prefetch setup mentioned in lines 1228-1238 of views.py
sed -n '1228,1238p' ami/main/api/views.pyRepository: RolnickLab/antenna
Length of output: 571
🏁 Script executed:
#!/bin/bash
# Find the predictions() method on Occurrence model
rg -n "def predictions" ami/main/models.py -A 10Repository: RolnickLab/antenna
Length of output: 1823
🏁 Script executed:
#!/bin/bash
# Check if there's any prefetch for best_prediction in the viewset
rg -n "best_prediction\|Prefetch.*prediction" ami/main/api/views.py -B 3 -A 3Repository: RolnickLab/antenna
Length of output: 44
🏁 Script executed:
#!/bin/bash
# Check the full context of the prefetch in views.py around line 1228
sed -n '1210,1250p' ami/main/api/views.pyRepository: RolnickLab/antenna
Length of output: 1576
🏁 Script executed:
#!/bin/bash
# Find OccurrenceListSerializer and check if it uses get_determination_details
rg -n "class OccurrenceListSerializer" ami/main/api/serializers.py -A 30Repository: RolnickLab/antenna
Length of output: 1498
🏁 Script executed:
#!/bin/bash
# Check if determination_details field is in the serializer
rg -n "determination_details\|get_determination_details" ami/main/api/serializers.py -B 2 -A 2Repository: RolnickLab/antenna
Length of output: 44
🏁 Script executed:
#!/bin/bash
# Find the method that serves determination_details
rg -n "def.*determination_details" ami/main/api/serializers.py -A 5Repository: RolnickLab/antenna
Length of output: 407
🏁 Script executed:
#!/bin/bash
# Since the field is called determination_details, look for get_determination_details method
rg -n "get_determination_details" ami/main/api/serializers.py -B 5 -A 20Repository: RolnickLab/antenna
Length of output: 1334
🏁 Script executed:
#!/bin/bash
# Check what OccurrenceSerializer is used for list vs detail views
rg -n "get_serializer_class\|OccurrenceListSerializer\|OccurrenceSerializer" ami/main/api/views.py -B 2 -A 3 | head -40Repository: RolnickLab/antenna
Length of output: 44
🏁 Script executed:
#!/bin/bash
# Verify the action check for prefetch application
sed -n '1225,1245p' ami/main/api/views.pyRepository: RolnickLab/antenna
Length of output: 762
🏁 Script executed:
#!/bin/bash
# Check if best_identification also has the same issue
rg -n "best_identification" ami/main/api/serializers.py -B 2 -A 2Repository: RolnickLab/antenna
Length of output: 397
🏁 Script executed:
#!/bin/bash
# Verify which serializer is used for list action
rg -n "def get_serializer_class" ami/main/api/views.py -A 10 | grep -A 10 "Occurrence"Repository: RolnickLab/antenna
Length of output: 577
🏁 Script executed:
#!/bin/bash
# Verify the prefetch condition one more time - check if list action gets prefetch
sed -n '1220,1250p' ami/main/api/views.pyRepository: RolnickLab/antenna
Length of output: 1226
🏁 Script executed:
#!/bin/bash
# Double-check the list serializer uses the problematic method
sed -n '1282,1330p' ami/main/api/serializers.py | grep -A 50 "class OccurrenceListSerializer"Repository: RolnickLab/antenna
Length of output: 2011
🏁 Script executed:
#!/bin/bash
# Let's verify the predictions() query more carefully
sed -n '2941,2960p' ami/main/models.pyRepository: RolnickLab/antenna
Length of output: 882
🏁 Script executed:
#!/bin/bash
# Check what with_identifications does - might it prefetch best_identification?
rg -n "def with_identifications" ami/main/models.py -A 15Repository: RolnickLab/antenna
Length of output: 731
best_prediction access in list view causes N+1 queries for masked classifications
The N+1 issue is valid. The list view queryset explicitly excludes the detections prefetch (if self.action != "list"), so OccurrenceListSerializer.get_determination_details() accesses obj.best_prediction without any eager loading. Since best_prediction is a @functools.cached_property that executes self.predictions().order_by("-terminal", "-score").first(), it triggers a fresh query. When the returned classification is serialized with ClassificationNestedSerializer (which includes the applied_to field), accessing applied_to and applied_to.algorithm on masked classifications causes two additional lazy-load queries per occurrence.
For a list of 10 occurrences where top classifications are masked results, this adds 30+ extra queries beyond the base query cost.
The long-term fix is to annotate and prefetch best_prediction at the queryset level (as the existing TODO suggests). Short-term: ensure the queryset used in predictions() eagerly loads applied_to__algorithm for classifications where it applies, or refactor get_determination_details() to selectively avoid serializing applied_to for list views.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ami/main/api/views.py` around lines 1228 - 1238,
OccurrenceListSerializer.get_determination_details() is triggering N+1 queries
by accessing Detection.best_prediction (a cached_property that calls
Detection.predictions().order_by(...).first()) without eager-loading applied_to
and applied_to.algorithm; fix by ensuring the queryset used when building
detections for list view (or inside Detection.predictions()) prefetches
applied_to__algorithm for Classification objects, or alternatively adjust
OccurrenceListSerializer to avoid serializing applied_to in list action (e.g.,
use a lightweight nested serializer for list views); target symbols:
OccurrenceListSerializer.get_determination_details, Detection.best_prediction,
Detection.predictions, and ClassificationNestedSerializer (and the
applied_to__algorithm prefetch) so the top prediction’s applied_to and its
algorithm are loaded eagerly.
Summary
Implements class masking as a post-processing task: given a taxa list, re-weight classifier predictions so only species in the list contribute to the softmax, then update classifications and occurrence determinations accordingly.
Built on top of the generic post-processing framework (#954). Also includes a rank rollup post-processing task.
What's been added
ClassMaskingTask— masks out classifier logits for species not in a given taxa list, recalculates softmax, creates new terminal classifications with updated scoresRankRollupTask— rolls up low-confidence species classifications to genus/family level when aggregate scores exceed configurable thresholdsOccurrenceAdminto trigger class masking with selectable TaxaList and Algorithmrun_class_masking --collection_id X --taxa_list_id Y --algorithm_id Zapplied_tofield exposed as a nested object in all Classification API serializersRelated
How to test
Prerequisites
docker compose up -dantenna@insectai.org/localadmin)logitsstored (e.g., VCE-Mothra dataset or demo project)Test class masking via admin UI
Verify results in the API (for UI development)
After running class masking, the API reflects the changes. Key endpoints:
Occurrence detail (includes nested classifications via detections):
Look at
detections[].classifications[]— you should see:terminal: false(demoted)terminal: trueandapplied_toset (the masked result)Classification list for a detection:
applied_toschemaWhen a classification was produced by post-processing (class masking, rank rollup),
applied_tolinks back to the original classification. It's a nested object:{ "id": 12345, "taxon": 42, "score": 0.87, "terminal": true, "algorithm": { "id": 15, "name": "Quebec & Vermont Classifier (filtered by taxa list Newfoundland Species)" }, "applied_to": { "id": 12300, "created_at": "2026-01-15T10:30:00Z", "algorithm": { "id": 10, "name": "Quebec & Vermont Species Classifier - Apr 2024" } }, "created_at": "2026-02-18T..." }For original (non-post-processed) classifications,
applied_toisnull.Fields in
applied_to:idcreated_atalgorithm{id, name, ...})UI integration notes for the identification tab
The occurrence identification tab (
MachinePredictioncomponent) currently shows all classifications. With post-processing:terminal: trueclassifications are the "current" predictions — display prominentlyterminal: falseclassifications are intermediate/superseded — show dimmed or collapsedapplied_to(when non-null) means this is a derived prediction. The UI could show something like "Derived from Quebec & Vermont Classifier prediction" usingapplied_to.algorithm.nameTest rank rollup via admin UI
Run tests
docker compose run --rm django python manage.py test ami.ml.tests.TestPostProcessingTasks --keepdbAll 6 tests should pass.
Remaining work
Triggering & usability
feat/logistic-binning(antenna) andfeat/class-masking(ami-data-companion) take different approaches. Reconcile.Score re-weighting bug
Review & validation
applied_tofield exposed in API — nested object with{id, created_at, algorithm: {id, name, ...}}UI/UX questions
applied_tois now in the API. UI needs to render lineage.Code quality
RankRollupTaskin registryAlgorithmCategoryMapfor masked classificationsDeployment Notes
Classification.logits— datasets processed before logit storage was added cannot be class-maskedChecklist
Screenshots
Admin: Class Masking Intermediate Form
UI: Occurrence Identifications Tab After Class Masking
Admin: New Classifications Created
Summary by CodeRabbit