Refactor standard deviation calculation#147
Open
wagnerlmichael wants to merge 6 commits intomainfrom
Open
Conversation
jeancochrane
approved these changes
Feb 24, 2025
Member
jeancochrane
left a comment
There was a problem hiding this comment.
I like this refactor! Can you share the query you ran to check that these results are identical to the existing results? Once I've verified that, we'll be good to go.
Comment on lines
+229
to
+244
| # Vectorized per-row lower and upper thresholds (mean ± std * multiplier) | ||
| for col in [f"sv_price_deviation_{group_str}", f"sv_cgdr_deviation_{group_str}"]: | ||
| df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[ | ||
| 0 | ||
| ] * df.groupby(list(groups))[col].transform("std") | ||
| df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[ | ||
| 1 | ||
| ] * df.groupby(list(groups))[col].transform("std") | ||
| if not condos: | ||
| df["sv_which_price"] = df.apply(which_price, args=(holds, groups), axis=1) | ||
|
|
||
| col = f"sv_price_per_sqft_deviation_{group_str}" | ||
| df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[ | ||
| 0 | ||
| ] * df.groupby(list(groups))[col].transform("std") | ||
| df[f"{col}_upper"] = df.groupby(list(groups))[col].transform("mean") + permut[ | ||
| 1 | ||
| ] * df.groupby(list(groups))[col].transform("std") |
Member
There was a problem hiding this comment.
[Suggestion, non-blocking] Nice generalization! We could go one step further and fold the if not condos branch into the for loop that precedes it:
thresh_cols = [
f"sv_price_deviation_{group_str}",
f"sv_cgdr_deviation_{group_str}"
] + ([] if condos else ["f"sv_price_per_sqft_deviation_{group_str}"])
for col in thresh_cols:
df[f"{col}_lower"] = df.groupby(list(groups))[col].transform("mean") - permut[
0
] * df.groupby(list(groups))[col].transform("std")
...| sqft_val = row[f"sv_price_per_sqft_deviation_{group_str}"] | ||
| sqft_lower = row[f"sv_price_per_sqft_deviation_{group_str}_lower"] | ||
| sqft_upper = row[f"sv_price_per_sqft_deviation_{group_str}_upper"] | ||
| sqft_out = sqft_val > sqft_upper or sqft_val < sqft_lower |
Member
There was a problem hiding this comment.
[Nitpick, non-blocking] Any reason not to use between_two_numbers() here, the way we do for the rest of the threshold checks?
Member
Author
|
Reminder: There were inconsistencies between the flag outputs in main and this branch, so if we proceed with this work we'll need to figure that out |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR showcases a potential refactor which reduces lines of code, and may improve readability. A potential cost to this refactor is modularity or computational cost (runtime). I've confirmed that we flag sales in the exact same way comparing outputs from this branch to the master.
Previously there had been 4 functions primarily responsible for the standard deviation calculations.
pricing_infoprice_columnwhich_priceget_threshThe standard deviation information had been held in a nested dictionary structure operated on with the
get_threshhelper function. In this PR we switch to vectorized operations and remove the nested dictionary strategy , which includedget_thresh.Edit:
After testing on the same subset of data it seems like the proposed change decreases runtime for the overarching
pricing_infofunction