Conversation
…model_evaluator to optionally include aggregate results and optionally use recall with fd
…rics to handle aggregation of good and bad matches. The logic is changed from reliance on _handle_hierarchical_field to compare_recursive in comparison_engine.
…turning the field_details. This will allow the aggregate metrics to correctly trickle up.
1) In comparison_dispatcher.py, inside dispatch_field_comparison function, use handle_list_field_dispatch only when simple list. When it's structured list, let compare_struct_list_with_scores handle structured list. 3) In structured_list_comparator.py, _calculate_nested_field_metrics, simplied the code. No threshold of matched pairs is applied. Field details are calculated for all matched pairs and matched objects. Note that the following update is no longer applies as the method is removed in dev: 2) In structured_list_comparator.py, in _handle_struct_list_empty_cases only handle when both are empty. Otherwise, run through the logic. This is necessary to get the field level metrics.
… (e.g., when field value is list versus string) 2) Removed checking for _aggregate flag in test_json_schema_field_converter.py 3) One major change in many test files:
* Update incorrect test cases to pass all test cases in test_structured_model_evaluator_nested.py. Some test cases still fail in test_ecommerce_orders_aggregate_comprehensive.py, which require additional code change. Clean up some print statements, unused tests. * Update comments per feedback --------- Co-authored-by: Hayley Park <parhyunj@amazon.com>
* Updating the logic in dispatcher: 1) Moving from runtime type checking to static field definition checking. Doing this by moving away from _should_use_hierarchical_structure and using _is_structured_list_field instead. This is done mainly for list fields, both primitive and structured. Future changes could expand to other types as needed. 2) when the field is a structured list, delegate handling all types of values whether null or not, to StructuredListComparator Updated structured_model: Add helper method to check if a field is specifically List[StructuredModel], complementing _is_list_field() which checks for any list type.
* Update the early exit logic in Hungarian matching to make sure there is always matched pairs for aggregate metrics --------- Co-authored-by: Hayley Park <parhyunj@amazon.com>
* Previously when a structured list was empty, it would be counted as one TN for aggregate at the object level but aggregation may miss this (because overall metrics are skipped when there are child instances). This change creates a field dict to capture TN at the child leaf node, so that aggregate metric can trickle up correctly. 2) Updated tests to reflect changes due to above logic
* Update hungarian to better handle single-item lists. Always match single-item lists, count as tp if the item similarity score is higher than match_threshold. Update fn count for a test_hungarian test case * Remove an unused, commented out variable
…per.is_effectively_null_for_primitives in comparison_dispatcher
🔒 Security Scan ResultsASH Security Scan Report
Scan Metadata
SummaryScanner ResultsThe table below shows findings by scanner, with status based on severity thresholds and dependencies:
Top 2 HotspotsFiles with the highest number of security findings:
Detailed FindingsShow 2 actionable findingsFinding 1: CKV2_GHA_1
Description: Finding 2: javascript.browser.security.insecure-document-method.insecure-document-method
Description: Code Snippet: Report generated by Automated Security Helper (ASH) at 2026-01-29T23:45:46+00:00 |
🔒 Security Scan ResultsASH Security Scan Report
Scan Metadata
SummaryScanner ResultsThe table below shows findings by scanner, with status based on severity thresholds and dependencies:
Top 2 HotspotsFiles with the highest number of security findings:
Detailed FindingsShow 2 actionable findingsFinding 1: CKV2_GHA_1
Description: Finding 2: javascript.browser.security.insecure-document-method.insecure-document-method
Description: Code Snippet: Report generated by Automated Security Helper (ASH) at 2026-01-30T00:20:31+00:00 |
🔒 Security Scan ResultsASH Security Scan Report
Scan Metadata
SummaryScanner ResultsThe table below shows findings by scanner, with status based on severity thresholds and dependencies:
Top 2 HotspotsFiles with the highest number of security findings:
Detailed FindingsShow 2 actionable findingsFinding 1: CKV2_GHA_1
Description: Finding 2: javascript.browser.security.insecure-document-method.insecure-document-method
Description: Code Snippet: Report generated by Automated Security Helper (ASH) at 2026-01-30T00:32:07+00:00 |
| # items.name should have 1 TP (only item1's name, item2 was below threshold) | ||
| # items.name should have 2 TP (both item1 and item2) | ||
| # Overall should have some metrics from poor matches at the leaf node level. | ||
| name_metrics = items_fields["name"] | ||
| if "overall" in name_metrics: | ||
| assert name_metrics["overall"]["tp"] == 1 | ||
| assert name_metrics["overall"]["tp"] == 2 | ||
| else: |
There was a problem hiding this comment.
I agree with this, looks correct as written here.
| # items.count should have 1 TP (only item1's count, item2 was below threshold) | ||
| # items.count should have 1 TP (only item1's count, item2's count did not pass comparison) | ||
| count_metrics = items_fields["count"] | ||
| if "overall" in count_metrics: | ||
| assert count_metrics["overall"]["tp"] == 1 # item1 count matches | ||
| assert ( | ||
| count_metrics["overall"]["fp"] == 0 | ||
| ) # No false positives since item2 not analyzed at field level | ||
| assert count_metrics["overall"]["fd"] == 0 # No false discoveries for count | ||
| count_metrics["overall"]["fp"] == 1 | ||
| ) # 1 False positive since item2 matched but count should have been empty | ||
| assert count_metrics["overall"]["fa"] == 1 # 1 false alarm for count | ||
| else: |
There was a problem hiding this comment.
Agree with this fix as well.
| "Items in EST but not in GT (null) should be False Alarm (FP)" | ||
| ) | ||
| assert items_cm["tn"] == 1, "Both empty lists should be TN" | ||
| assert items_cm["tn"] == 1, f"Both empty lists should be TN: {items_cm}" |
There was a problem hiding this comment.
Just better assertion comment.
| name_metrics = items_fields["name"] | ||
| if "overall" in name_metrics: | ||
| assert name_metrics["overall"]["tp"] == 1 | ||
| assert name_metrics["overall"]["tp"] == 2 |
There was a problem hiding this comment.
Agree, there are two 'name' instances.
| # items.description should have 1 TP (only item1's description, item2 was below threshold) | ||
| # items.description should have 1 TP (only item1's description, item2's description did not pass comparison) | ||
| desc_metrics = items_fields["description"] | ||
| if "overall" in desc_metrics: | ||
| assert desc_metrics["overall"]["tp"] == 1 # item1 description matches | ||
| assert ( | ||
| desc_metrics["overall"]["fd"] == 0 | ||
| ) # No false discoveries since item2 not analyzed at field level | ||
| assert desc_metrics["overall"]["fp"] == 0 # No false positives for description | ||
| desc_metrics["overall"]["fd"] == 1 | ||
| ) # 1 false discoveries since item2 matched but description not correct at field level | ||
| assert desc_metrics["overall"]["fp"] == 1 # 1 false positives for description | ||
| else: | ||
| assert desc_metrics["tp"] == 1 # item1 description matches |
There was a problem hiding this comment.
I agree with this assertion too.
| # Direct access to metrics for pet fields (not in "overall") | ||
| # petId metrics | ||
| assert ( | ||
| get_metric(cm["fields"]["pets"]["fields"]["petId"], "tp") == 1 | ||
| ), "Expected 1 true positives" | ||
| get_metric(cm["fields"]["pets"]["fields"]["petId"], "tp") == 2 | ||
| ), "Expected 2 true positives" | ||
| assert ( |
| get_metric(cm["fields"]["pets"], "tn") == 0 | ||
| get_metric(cm["fields"]["pets"], "tn") == 0, | ||
| ), "Expected 0 true negatives for pets field overall performance" | ||
|
|
There was a problem hiding this comment.
I think we need to remove that extra comma and that should fix it
| get_metric(cm["fields"]["pets"]["fields"]["name"], "tp") == 1 | ||
| ), "Expected 1 true positives" | ||
| get_metric(cm["fields"]["pets"]["fields"]["name"], "tp") == 2 | ||
| ), "Expected 2 true positives" |
There was a problem hiding this comment.
Again, same comment as before https://github.com/awslabs/stickler/pull/72/changes#r2776109862
the cat buttons falls below the default match_threshold of the Pet object, therefore it's incorect to compare the values of the name. Even though they match.
| get_metric(cm["fields"]["pets"]["fields"]["species"], "tp") == 0 | ||
| ), "Expected 0 true positives" | ||
| get_metric(cm["fields"]["pets"]["fields"]["species"], "tp") == 1 | ||
| ), "Expected 1 true positives" |
| expected_top_aggregate = { | ||
| 'tp': 12, 'fa': 1, 'fd': 4, 'fp': 5, 'tn': 0, 'fn': 1 | ||
| } | ||
|
|
| # Exact match: PROD-001 vs PROD-001 should be TP | ||
| assert field_metrics["overall"]["tp"] == 1, ( | ||
| f"Nested field {field} should have 1 TP" | ||
| # Even though poor match, at the field level this is an exact match: PROD-002 |
There was a problem hiding this comment.
Actually, this not right, per my comment below. https://github.com/awslabs/stickler/pull/72/changes#r2776109862
sromoam
left a comment
There was a problem hiding this comment.
So we do need some changes to this PR.
Apologies @hayleypark that I didn't catch this earlier.
I recommend you strip the scope down to only the aggregate metrics and the tests here.
The changes to the non-aggregate tests are incorrect, after a review, becuase of the logic we have documented for how we evalaute list objects. https://github.com/awslabs/stickler/blob/d24894c86a2c52a62fa306748cbf2e7740faebe3/docs/docs/Core-Concepts/Threshold_Gated_Recursive_Evaluation.md
This is our current method, docuemnted, where we don't count any confusion matrix counts in objects that aren't matched up to the object's match_threshold, with the logic being, if you need this comparison ,you can turn that value down. IF you want to propose a change to this logic, that's fine, but we need to revise the docuemtnation as well.
| field_metrics["overall"]["tp"] + field_metrics["overall"]["fd"] | ||
| == 1 | ||
| ), f"Nested field {field} should have 1 comparison" | ||
| == 3 | ||
| ), f"Nested field {field} should have 3 comparisons" | ||
| elif field == "price": |
There was a problem hiding this comment.
Actually, now that I'm understanding this better, this is not correct.
The reason is that the second and third products from the list do not match eachother respectively and therfore are considered false detections at the 'product' level. The way to make this assertion true in this test, would be to set the match_threshold on the Product object lower, so that the comparisons would proceed. The current match threshold at 0.8 precludes the 2nd and thrid list items, with the assertion that these are spuriously matched, and thereofre are not valid to compare to each-other.
| # Exact match: PROD-001 vs PROD-001 should be TP | ||
| assert field_metrics["overall"]["tp"] == 1, ( | ||
| f"Nested field {field} should have 1 TP" | ||
| # Even though poor match, at the field level this is an exact match: PROD-002 |
There was a problem hiding this comment.
Actually, this not right, per my comment below. https://github.com/awslabs/stickler/pull/72/changes#r2776109862
| get_metric(cm["fields"]["pets"]["fields"]["name"], "tp") == 1 | ||
| ), "Expected 1 true positives" | ||
| get_metric(cm["fields"]["pets"]["fields"]["name"], "tp") == 2 | ||
| ), "Expected 2 true positives" |
There was a problem hiding this comment.
Again, same comment as before https://github.com/awslabs/stickler/pull/72/changes#r2776109862
the cat buttons falls below the default match_threshold of the Pet object, therefore it's incorect to compare the values of the name. Even though they match.
Issue #, if available: #33 & Fix universal aggregate metrics calculation
Description of changes:
Updated Hungarian Matching for Structured Lists
Updated
StructuredListComparatorDispatcher Logic Refactoring:
_is_structured_list_field)StructuredListComparatorImproved Test Cases for Aggregate Functionality:
Enhanced Bulk Evaluator with Aggregate Support:
include_aggregatesparameter toBulkStructuredModelEvaluatorrecall_with_fdparameter for flexible recall calculationsAssumption: For aggregate metrics, all matched items (regardless of the similarity score) are considered.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.