-
Notifications
You must be signed in to change notification settings - Fork 21
More stable algorithm for variance, standard deviation #456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
flox/core.py
Outdated
for reduction, fv, kw, dt in zip(funcs, fill_values, kwargss, dtypes): | ||
if empty: | ||
# UGLY! but this is because the `var` breaks our design assumptions | ||
if empty and reduction is not var_chunk: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code path is an "optimization" for chunks that don't contain any valid groups. so group_idx
is all -1
.
We will need to override full
in MultiArray. Look up what the like
kwarg does here, it dispatches to the appropriate array type.
The next issue will be that fill_value is a scalar like np.nan
but that doesn't work for all our intermediates (e.g. the "count").
- My first thought is that
MultiArray
will need to track a default fill_value per array. Forvar
, this can be initialized to(None, None, 0)
. IfNone
we use thefill_value
passed in; else the default. - The other way would be to hardcode some behaviour in
_initialize_aggregation
so thatagg.fill_value["intermediate"] = ( (fill_value, fill_value, 0), )
, and then multi-array can receive that tuple and do the "right thing".
The other place this will matter is in reindex_numpy
, which is executed at the combine step. I suspect the second tuple approach is the best.
This bit is hairy, and ill-defined. Let me know if you want me to work through it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm partway through implementing something to work here.
- How do I trigger this code pathway without brute force overwriting
if empty:
withif True:
- When np.full is called,
like
is a np array not a MultiArray, because it's (I think) the chunk data and bypassing var_chunk (could also be an artefact of theif True
override above?). In a pinch, I guess I could add an elif that catches theempty and reduction is var_chunk
and co-erce that into a MultiArray, but it's also ugly so I'm hoping you might have better ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking some more, I may have misinterpreted what fill_value is used for. When is it needed for intermediates?
This is great progress! Now we reach some much harder parts. I pushed a commit to show where I think the "chunk" function should go and left a few comments. I think the next steps should be to
|
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Just to make sure I understand, there are situations in which var_combine needs to be able to handle combining intermediates along multiple axes at once? If so, that's tricky to handle because the equation I'm using merges MultiArray sets of intermediates pairwise. I think the best way to handle it is probably to stack those axes? (or is that bad for memory?) Or possibly deal with dimensions one after the other? Or we could do something awful with a 2D cumulative sum that loops around for calculating the adjustment terms, but that'd pretty much guarantee that the code is unintelligible to anybody else. |
we could explicitly accumulate to float64 intermediates in
Yes I think a reshape would be fine; the axes you are reshaping will be contiguous so this will end up being a view of the original, so no extra memory. Just make sure to reshape back to the same dimensionality at the end. |
Jemma, please let me know how I can help here. I'm happy to tackle some of these failing tests if you prefer |
This reverts commit d77c132.
Thanks Deepak. I'm pretty busy the next few weeks, which is just to say if I drop off for a couple days it's not because I've lost interest. Back to normal 23rd September, but I'll try and get this across the line this week or early next. I've got an sense of how to implement the reshape/multiple axes for var_chunk, and am hoping to get to this soon. I think this will fix most of the failing tests? Regarding casting to float64, I'm not confident that I've thought through all the edge cases. ie, we probably wouldn't want to cast np.longdouble back to np.float64? Or is it a safe assumption that np.float64 is a good idea for intermediates no matter what the inputs were? If you have a solution that you're happy with, then feel free to fix this one, otherwise I'll get to it when I can. It's probably pretty obvious, but I've got basically no familiarity with pytest so I'm developing that familiarity while trying to use it here. I'm happy to keep working my way slowly through it, but I'm also happy for you to take on other failing tests. What do you see as the outstanding tasks to get this PR finalised? Is it just to address the causes of all the failing tests? I think we might have had a couple unresolved comment threads from code reviews that I'll try track down. Any suggestions for how you'd normally keep track of the last few things to do? |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
pytest is an acquired taste hehe, and it's used in complex ways here. Some tricks that will help:
No one's really using that, or if they are, we can fix it when they complain.
yes happy to merge when tests pass. Looks like you're just down to the property tests failing?! If so, I can clear those up. |
I think so? I wrapped the part of my code that combined along a single axis in a for loop that means it only has to handle one axis at a time. All the rest of the ideas I tried were nasty to implement.
Please :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, fair.
I was deliberately trying to avoid triggering the "invalid value encountered in np divide" warning, though I concede it was a bit of a hack.
Do we want to force den == 0
also to be NaN? (as opposed to inf?)
9391952
to
8eaddc1
Compare
so happy this is in! 👏🏾 👏🏾 👏🏾 |
Updated algorithm for nanvar, to use an adapted version of the Schubert and Gertz (2018) paper mentioned in #386, following discussion in #422
Closes #386
Closes #422