-
Notifications
You must be signed in to change notification settings - Fork 5
First draft of Reporting Source of Truth™ #496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
wrridgeway
wants to merge
103
commits into
master
Choose a base branch
from
387-reporting-sot
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
103 commits
Select commit
Hold shift + click to select a range
d780d76
First draft of sales script
wrridgeway 00909fd
File renaming
wrridgeway 2ac5982
Cleaner for loop
wrridgeway 2107d2a
First draft taxes and exemptions table
wrridgeway c56aaaf
Wrap assessment_roll
wrridgeway 6c81308
Correct size, count calculations
wrridgeway 1bf9b9c
Wrap sales table
wrridgeway 0a9e1f3
Correct stage grouping, counting
wrridgeway 030a7c5
Fix assessment roll stage grouping
wrridgeway 1c2adae
Clean output before writing
wrridgeway 672bd1e
Begin dbt building
wrridgeway 0c42e23
Merge branch 'master' into 387-reporting-sot
wrridgeway 3f60a77
Attempt to build assessment_roll table
wrridgeway fdff457
Testing build on smaller input
wrridgeway 6abd074
Trying to build on limited sample
wrridgeway fd342b6
Try to build sales table
wrridgeway cccf8e1
Try to build taxes and exemptions table
wrridgeway 3656964
Try to build taxes and exemptions table
wrridgeway 8b0f95f
Try to build taxes table
wrridgeway 9383bdc
Try to build ratio stats table
wrridgeway 08d3bd6
Add assesspy to ratio_stats table
wrridgeway d2cac22
ratio_stats builds in dbt, excluding assesspy funcs
wrridgeway f559753
sot_ratio_stats table building in dbt
wrridgeway 1f8ad1f
Add res_other group
wrridgeway 063591c
Add reassessment year indicator for assessment roll
wrridgeway a9ffc64
Retry assessment_year indicator
wrridgeway 62dd68e
Assessment_roll should run with reassessment year indicator
wrridgeway c185e81
Add schema to assessment_roll table
wrridgeway d08bc3d
Correct output from sales and taxes tables
wrridgeway 4808aa4
Add table schemas
wrridgeway 08c8d53
Fix schemas
wrridgeway 2f8dc3d
Resolve sales table column type issues
wrridgeway 88ce049
Add exe_total to exemptions table
wrridgeway 271576d
Add more ratio stats
wrridgeway c39a2d8
Clean sales table columns
wrridgeway 20c9bd6
Clean taxes table columns
wrridgeway adc16ea
Clean assessment_roll columns
wrridgeway f8b87ab
Fix delta columns
wrridgeway 54ebab8
Clean ratio table columns
wrridgeway d2dddab
Attempt to fix pin_n_tot type error that doesn't trigger locally
wrridgeway 00e790c
Try again to fix pin_n_tot
wrridgeway 408de56
Change ass roll sample to be able to compare across stages
wrridgeway fd95fcb
Add commenting for input tables, try to partion assessment_roll table
wrridgeway f296292
Comment python scripts
wrridgeway a23ff72
Clean up ratio_stats script
wrridgeway 07f6dfe
Back to fixing pin_n_tot
wrridgeway b78a072
Replace nan with None
wrridgeway 337954e
Partition input tables by year
wrridgeway 1031144
Fix year partitioning
wrridgeway 45ea305
Use double for nullable columns
wrridgeway ca139f3
Move data year specification to dbt seed
wrridgeway 788f971
Formatting
wrridgeway 4ea6718
Merge branch 'master' into 387-reporting-sot
wrridgeway 5449d8c
Improve diff and pct_change syntax
wrridgeway c87713f
Simplify reassessment year syntax
wrridgeway d1079f0
More commenting
wrridgeway b4316b2
Merge branch 'master' into 387-reporting-sot
wrridgeway 28ba90c
Lint
wrridgeway cb50a51
Clean up
wrridgeway 978ad93
Use new assesspy inputs
wrridgeway a471a42
Update assesspy version
wrridgeway d28f02c
Add back documentation
wrridgeway f8258cf
Improve documentation
wrridgeway e213a32
Add outlier sales filtering
wrridgeway 1c8f1b3
Count outlier sales
wrridgeway d26fed0
Exclude outliers from sales char stats
wrridgeway 91c5040
Clarify bldg and land sf
wrridgeway c3cc7ba
Improve schema declaration
wrridgeway c8230ea
Update schema declarations
wrridgeway 34010b8
Merge branch 'master' into 387-reporting-sot
wrridgeway 023c341
Store testing
wrridgeway 57e4cc3
Remove test script
wrridgeway 68d8c2d
Merge branch 'master' into 387-reporting-sot
wrridgeway b0d9ad7
Test script back to working
wrridgeway b7ee5bd
Merge branch 'master' into 387-reporting-sot
wrridgeway f056835
Temp changes
wrridgeway 744ddd6
Everything but delta cols
wrridgeway ae79bc3
Remove vestigial objects
wrridgeway 080d5b7
Simplify schema creation
wrridgeway de27eb8
Aggregate spark dfs
wrridgeway 98f1bed
Merge branch 'master' into 387-reporting-sot
wrridgeway 214ec7a
Remove temp limit on ass roll table
wrridgeway 26f00e3
Try table build with spark
wrridgeway 2509e40
Remove old table, rerun build to gen error log
23c6fb8
Debugging input pyspark dataframe
763915e
Pass geography to aggregate
wrridgeway 29c15a1
Reduce input size to test runner memory limits
wrridgeway b99dd24
Really reduce input size
wrridgeway 76a7df5
Further reduce input size
wrridgeway 8fa0e4e
Try a really small input
wrridgeway 8f1ee19
Change int type for pyarrow
wrridgeway b8bdf39
Try coercing expected string columns
wrridgeway f33a2e4
Remove string coersion for output table
wrridgeway 2264e1d
Try to increase max driver result for spark session
wrridgeway dfb7d1d
Change spark driver config access
wrridgeway 2323aea
One more driver attempt
wrridgeway 8cfc713
Try new engine config
wrridgeway db21c18
Test smaller amount of collection
wrridgeway e6681fe
Remove config without permission
wrridgeway 34fa863
Test using entire input
wrridgeway 9b5d48f
Revert for now
wrridgeway e4aeb0e
Remove limit again for testing
wrridgeway e146066
Attempt to collect more often
wrridgeway File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -78,5 +78,7 @@ seeds: | |
| +schema: location | ||
| model: | ||
| +schema: model | ||
| reporting: | ||
| +schema: reporting | ||
| spatial: | ||
| +schema: spatial | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,165 @@ | ||
| # This script generates aggregated summary stats on assessed values across a | ||
| # number of geographies, class combinations, and time. | ||
|
|
||
| # Import libraries | ||
|
|
||
| import pandas as pd | ||
| from pyspark.sql.functions import lit | ||
|
|
||
|
|
||
| # Define aggregation functions. These are just wrappers for basic python | ||
| # functions that make using them easier to use with pandas.agg(). | ||
| def q10(x): | ||
| return x.quantile(0.1) | ||
|
|
||
|
|
||
| def q25(x): | ||
| return x.quantile(0.25) | ||
|
|
||
|
|
||
| def q75(x): | ||
| return x.quantile(0.75) | ||
|
|
||
|
|
||
| def q90(x): | ||
| return x.quantile(0.9) | ||
|
|
||
|
|
||
| def first(x): | ||
| if len(x) >= 1: | ||
| output = x.iloc[0] | ||
| else: | ||
| output = None | ||
|
|
||
| return output | ||
|
|
||
|
|
||
| def reassessment_year(year, geography, triad): | ||
| if geography in ["triad", "township", "nbhd"]: | ||
| year = int(year) % 3 | ||
|
|
||
| if ( | ||
| ((year == 0) & (triad == "North")) | ||
| | ((year == 1) & (triad == "South")) | ||
| | ((year == 2) & (triad == "City")) | ||
| ): | ||
| out = "Yes" | ||
| else: | ||
| out = "No" | ||
| else: | ||
| out = "" | ||
|
|
||
| return out | ||
|
|
||
|
|
||
| def aggregate_geography(geography): | ||
| def aggregate(key, pdf): | ||
| columns = ["av_tot", "av_bldg", "av_land"] | ||
|
|
||
| out = () | ||
| out += ( | ||
| reassessment_year(pdf["year"][0], geography, pdf["triad"][0]), | ||
| first(pdf[years[geography]]), | ||
| len(pdf["av_tot"]), | ||
| pdf["av_tot"].count(), | ||
| pdf["av_tot"].count() / pdf["av_tot"].size, | ||
| ) | ||
| for column in columns: | ||
| out += ( | ||
| pdf[column].min(), | ||
| q10(pdf[column]), | ||
| q25(pdf[column]), | ||
| pdf[column].median(), | ||
| q75(pdf[column]), | ||
| q90(pdf[column]), | ||
| pdf[column].max(), | ||
| pdf[column].mean(), | ||
| pdf[column].sum(), | ||
| ) | ||
|
|
||
| return pd.DataFrame([key + out]) | ||
|
|
||
| return aggregate | ||
|
|
||
|
|
||
| groups = [ | ||
| "res_other", | ||
| "major_class", | ||
| "no_group", | ||
| "class", | ||
| "modeling_group", | ||
| ] | ||
|
|
||
| years = { | ||
| "county": "year", | ||
| "triad": "year", | ||
| "township": "year", | ||
| "nbhd": "year", | ||
| "tax_code": "year", | ||
| "zip_code": "year", | ||
| "community_area": "community_area_data_year", | ||
| "census_place": "census_data_year", | ||
| "census_tract": "census_data_year", | ||
| "census_congressional_district": "census_data_year", | ||
| "census_zcta": "census_data_year", | ||
| "cook_board_of_review_district": "cook_board_of_review_district_data_year", | ||
| "cook_commissioner_district": "cook_commissioner_district_data_year", | ||
| "cook_judicial_district": "cook_judicial_district_data_year", | ||
| "ward_num": "ward_data_year", | ||
| "police_district": "police_district_data_year", | ||
| "school_elementary_district": "school_data_year", | ||
| "school_secondary_district": "school_data_year", | ||
| "school_unified_district": "school_data_year", | ||
| "tax_municipality": "tax_data_year", | ||
| "tax_park_district": "tax_data_year", | ||
| "tax_library_district": "tax_data_year", | ||
| "tax_fire_protection_district": "tax_data_year", | ||
| "tax_community_college_district": "tax_data_year", | ||
| "tax_sanitation_district": "tax_data_year", | ||
| "tax_special_service_area": "tax_data_year", | ||
| "tax_tif_district": "tax_data_year", | ||
| "central_business_district": "central_business_district_data_year", | ||
| } | ||
|
|
||
| geographies = list(years.keys()) | ||
| geographies = [ | ||
| geographies[0] | ||
| ] # For testing purposes, only use the first geography | ||
|
|
||
| output_schema = "stage_name string, group_id string, geography_id string, year string, reassessment_year string, geography_data_year string, pin_n_tot bigint, pin_n_w_value bigint, pin_pct_w_value double, min_av_tot double, q10_av_tot double, q25_av_tot double, median_av_tot double, q75_av_tot double, q90_av_tot double, max_av_tot double, mean_av_tot double, sum_av_tot double, min_av_bldg double, q10_av_bldg double, q25_av_bldg double, median_av_bldg double, q75_av_bldg double, q90_av_bldg double, max_av_bldg double, mean_av_bldg double, sum_av_bldg double, min_av_land double, q10_av_land double, q25_av_land double, median_av_land double, q75_av_land double, q90_av_land double, max_av_land double, mean_av_land double, sum_av_land double" | ||
|
|
||
|
|
||
| def model(dbt, spark_session): | ||
| dbt.config( | ||
| materialized="table", | ||
| engine_config={ | ||
| "MaxConcurrentDpus": 40, | ||
| }, | ||
| ) | ||
|
|
||
| athena_user_logger.info("Loading assessment roll input table") | ||
|
|
||
| input = dbt.ref("reporting.sot_assessment_roll_input") | ||
|
|
||
| athena_user_logger.info("Dope stuff is happening... maybe?") | ||
|
|
||
| output = [] | ||
| for group in groups: | ||
| for geography in geographies: | ||
| output += [ | ||
| input.groupby(["stage_name", group, geography, "year"]) | ||
| .applyInPandas( | ||
| aggregate_geography(geography), | ||
| schema=output_schema, | ||
| ) | ||
| .select( | ||
| "*", | ||
| lit(group).alias("group_type"), | ||
| lit(geography).alias("geography_type"), | ||
| ) | ||
| .toPandas() | ||
| ] | ||
|
|
||
| df = pd.concat(output) | ||
|
|
||
| return df |
104 changes: 104 additions & 0 deletions
104
dbt/models/reporting/reporting.sot_assessment_roll_input.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| -- This script gathers parcel-level geographies and joins them to values and | ||
| -- class groupings. Its sole purpose is to feed reporting.sot_assessment_roll, | ||
| -- and should not be used otherwise. | ||
| {{ | ||
| config( | ||
| materialized='table', | ||
| partitioned_by=['year'] | ||
| ) | ||
| }} | ||
|
|
||
| /* Ensure every municipality/class/year has a row for every stage through | ||
| cross-joining. This is to make sure that combinations that do not yet | ||
| exist in iasworld.asmt_all for the current year will exist in the view, but have | ||
| largely empty columns. For example: even if no class 4s in the City of Chicago | ||
| have been mailed yet for the current assessment year, we would still like an | ||
| empty City of Chicago/class 4 row to exist for the mailed stage. */ | ||
| WITH stages AS ( | ||
|
|
||
| SELECT 'MAILED' AS stage_name | ||
| UNION | ||
| SELECT 'ASSESSOR CERTIFIED' AS stage_name | ||
| UNION | ||
| SELECT 'BOR CERTIFIED' AS stage_name | ||
|
|
||
| ), | ||
|
|
||
| -- Universe of all parcels as defined by iasworld.pardat, expanded with | ||
| -- assessment stages. | ||
| uni AS ( | ||
| SELECT | ||
| vw_pin_universe.*, | ||
| stages.* | ||
| FROM {{ ref('default.vw_pin_universe') }} | ||
| CROSS JOIN stages | ||
| ) | ||
|
|
||
| SELECT | ||
| uni.stage_name, | ||
| uni.class, | ||
| CAST(vals.tot AS INT) AS av_tot, | ||
| CAST(vals.bldg AS INT) AS av_bldg, | ||
| CAST(vals.land AS INT) AS av_land, | ||
| 'Cook' AS county, | ||
| uni.triad_name AS triad, | ||
| uni.township_name AS township, | ||
| uni.nbhd_code AS nbhd, | ||
| uni.tax_code, | ||
| uni.zip_code, | ||
| uni.chicago_community_area_name AS community_area, | ||
| uni.census_place_geoid AS census_place, | ||
| uni.census_tract_geoid AS census_tract, | ||
| uni.census_congressional_district_geoid | ||
| AS | ||
| census_congressional_district, | ||
| uni.census_zcta_geoid AS census_zcta, | ||
| uni.cook_board_of_review_district_num AS cook_board_of_review_district, | ||
| uni.cook_commissioner_district_num AS cook_commissioner_district, | ||
| uni.cook_judicial_district_num AS cook_judicial_district, | ||
| uni.ward_num, | ||
| uni.chicago_police_district_num AS police_district, | ||
| uni.school_elementary_district_geoid AS school_elementary_district, | ||
| uni.school_secondary_district_geoid AS school_secondary_district, | ||
| uni.school_unified_district_geoid AS school_unified_district, | ||
| ARRAY_JOIN(uni.tax_municipality_name, ', ') AS tax_municipality, | ||
| ARRAY_JOIN(uni.tax_park_district_name, ', ') AS tax_park_district, | ||
| ARRAY_JOIN(uni.tax_library_district_name, ', ') AS tax_library_district, | ||
| ARRAY_JOIN(uni.tax_fire_protection_district_name, ', ') | ||
| AS tax_fire_protection_district, | ||
| ARRAY_JOIN(uni.tax_community_college_district_name, ', ') | ||
| AS | ||
| tax_community_college_district, | ||
| ARRAY_JOIN(uni.tax_sanitation_district_name, ', ') | ||
| AS tax_sanitation_district, | ||
| ARRAY_JOIN(uni.tax_special_service_area_name, ', ') | ||
| AS tax_special_service_area, | ||
| ARRAY_JOIN(uni.tax_tif_district_name, ', ') AS tax_tif_district, | ||
| uni.econ_central_business_district_num AS central_business_district, | ||
| uni.census_data_year, | ||
| uni.cook_board_of_review_district_data_year, | ||
| uni.cook_commissioner_district_data_year, | ||
| uni.cook_judicial_district_data_year, | ||
| COALESCE( | ||
| uni.ward_chicago_data_year, uni.ward_evanston_data_year) AS | ||
| ward_data_year, | ||
| uni.chicago_community_area_data_year AS community_area_data_year, | ||
| uni.chicago_police_district_data_year AS police_district_data_year, | ||
| uni.econ_central_business_district_data_year | ||
| AS | ||
| central_business_district_data_year, | ||
|
wrridgeway marked this conversation as resolved.
|
||
| uni.school_data_year, | ||
| uni.tax_data_year, | ||
| 'no_group' AS no_group, | ||
| class_dict.major_class_type AS major_class, | ||
| class_dict.modeling_group, | ||
| CASE WHEN class_dict.major_class_code = '2' THEN 'RES' ELSE 'OTHER' END | ||
| AS res_other, | ||
| uni.year | ||
| FROM uni | ||
| LEFT JOIN {{ ref('reporting.vw_pin_value_long') }} AS vals | ||
| ON uni.pin = vals.pin | ||
| AND uni.year = vals.year | ||
| AND uni.stage_name = vals.stage_name | ||
| LEFT JOIN {{ ref('ccao.class_dict') }} | ||
| ON uni.class = class_dict.class_code | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.