Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #174 +/- ##
==========================================
+ Coverage 69.39% 72.19% +2.80%
==========================================
Files 19 21 +2
Lines 1186 1356 +170
==========================================
+ Hits 823 979 +156
- Misses 363 377 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
# Conflicts: # cinnabar/tests/test_compare.py
|
|
||
|
|
||
| def compare_and_rank_femaps( | ||
| femaps: list[FEMap], |
There was a problem hiding this comment.
I wonder if it would be easier to have a single FEMap object that contains calc data from multiple experiments, using the "source" argument to distinguish between?
There was a problem hiding this comment.
Yes that would be easier as the current method requires you to add the same experimental info to all of the maps you are comparing, but we would need to rework the MLE calculation on the legacy graph to group by source first and then split into different edges so kind of doing this under the hood, for now it might be easier to ask users to do this? We could add something which copies the experimental values to all graphs if that would be easier?
There was a problem hiding this comment.
I believe that was the intent of the FEMap API - i.e. to have multiple sources so it's easier to do one big comparison rather than multiple objects.
Description
This PR adds a general compare and ranking function for a collection of FEMaps which contain results for the same edges which should be compared for any significant differences in performance. The FEMaps are compared by the user chosen metric, by default the mean unsigned error, via the distribution of differences of that metric under a joint bootstrapping procedure. A two-sided p-value is then determined via the fraction of negative or positive differences in the metric (whichever is smaller), a multitest correction scheme is then used to correct the p-values which are used to rank the FEMaps using the compact letter display (CLD) assigned via the insert-absorb method.
The result is two pandas dataframes, the first contains all evaluated metrics for each FEMap with confidence intervals and the CLD rank, the second contains the raw comparison data of the metric including p-values.
Example stats dataframe
Example comparsion
Todos
Notable points that this PR has either accomplished or will accomplish.
Questions
Checklist
newsentry for new features, bug fixes, or other user facing changes.Status
Tips
Since this will create a commit, it is best to make this comment when you are finished with your work.