Refactor agency crosswalk and add agency fund crosswalk#89
Conversation
…crosswalk-data-model-to-better-reflect-underlying-data-change
… crosswalks in `agency.R`
…und_crosswalk` in agencies vignette
| # Write both data sets to S3 | ||
| arrow::write_parquet( | ||
| x = agency %>% select(-agency_name), | ||
| sink = remote_path_agency, | ||
| compression = "zstd" | ||
| ) | ||
| arrow::write_parquet( | ||
| x = agency_info, | ||
| sink = remote_path_agency_info, | ||
| compression = "zstd" | ||
| ) |
There was a problem hiding this comment.
These lines are not new, the diff is just confusing because some of the agency crosswalk code is extracted from the old agency_info code and git is confused about it.
| mutate( | ||
| fund_num_final = case_when( | ||
| # Levy adjustments (408) have the same fund numbers across all years, so | ||
| # handle them separately | ||
| fund_type_num == "408" ~ fund_num, | ||
| minor_type == "LIBRARY" ~ paste0(fund_type_num, "001"), | ||
| minor_type == "GEN ASST" ~ paste0(fund_type_num, "002"), | ||
| minor_type == "INFRA" ~ paste0(fund_type_num, "003"), | ||
| minor_type == "HEALTH" & | ||
| str_detect(agency_name, "MENTAL") ~ paste0(fund_type_num, "004"), | ||
| minor_type == "HEALTH" & | ||
| str_detect(agency_name, "PUBLIC") ~ paste0(fund_type_num, "005") | ||
| ) |
There was a problem hiding this comment.
These rules are not documented as far as I know, but I determined them through trial and error. My QC of all of the resulting fund numbers confirmed that they are correct.
There was a problem hiding this comment.
This makes sense and is a smart workaround!
| # Set the database version. This gets incremented manually whenever the database | ||
| # changes. This is checked against Config/Requires_DB_Version in the DESCRIPTION | ||
| # file via check_db_version(). Schema is: | ||
| # "MAX_YEAR_OF_DATA.MAJOR_VERSION.MINOR_VERSION-PRE_RELEASE_VERSION" |
There was a problem hiding this comment.
I only just realized that this line is documenting the format for db_version, but db_pre_release_version is separate. I moved the pre-release component of this documentation to a separate comment on db_pre_release_version.
| # Starting in 2024, the data source for the `pin` table has changed. | ||
| # The current package maintainers do not have access to the old data source, | ||
| # which was a snapshot mirror of a mainframe system that has been | ||
| # decommissioned. To facilitate future updates, we copied over pre-2024 | ||
| # `pin` files without edits. These legacy files are missing some columns | ||
| # that we added in 2024, so we need to unify the schemas across files, since | ||
| # otherwise arrow will take the schema from the first file it finds in the | ||
| # dataset. In the future, we could also consider editing the old files to | ||
| # add empty values for the new columns |
There was a problem hiding this comment.
The prior iteration of this comment was slightly misleading, so I updated it for accuracy and clarity.
| agency_num_final varchar(9) NOT NULL, | ||
| PRIMARY KEY (agency_num) | ||
| ) WITHOUT ROWID; | ||
|
|
There was a problem hiding this comment.
No indexes on either of these crosswalks because the primary keys add automatic indexes. The tables are also pretty small so the indexes shouldn't provide much performance improvement anyway.
| Version: 1.1.0 | ||
| Authors@R: c( | ||
| person(given = "Kyra", family = "Sturgill", email = "kyra.sturgill@cookcountyil.gov", role = c("aut", "cre")), | ||
| person(given = "Kyra", family = "Sturgill", email = "Assessor.Data@cookcountyil.gov", role = c("aut", "cre")), |
There was a problem hiding this comment.
Changing this from your personal email to the shared team email for the sake of privacy and durability. I don't feel strongly about it though, if you'd rather have your personal email listed we can keep it that way.
| | tif_crosswalk | Clerk | Manually created from TIF summary and distribution reports | [data-raw/tif/tif.R](data-raw/tif/tif.R) | Fix for data issue identified in #39 | | ||
| | tif_distribution | Clerk | [TIF Reports - Tax Increment Agency Distribution Reports](https://www.cookcountyclerkil.gov/property-taxes/tifs-tax-increment-financing/tif-reports) | [data-raw/tif/tif.R](data-raw/tif/tif.R) | TIF EAV, frozen EAV, and distribution percentage by tax code | | ||
| | pin_tif_distribution | Clerk | [TIF Reports - Tax Increment Agency Distribution Reports](https://www.cookcountyclerkil.gov/property-taxes/tifs-tax-increment-financing/tif-reports) | [data-raw/tif/tif.R](data-raw/tif/tif.R) | TIF EAV, frozen EAV, and distribution percentage by PIN | | ||
| | Table Name | Source Agency | Source Link | Ingest Script | Contains | |
There was a problem hiding this comment.
The main diff here is adding rows for agency_crosswalk and agency_fund_crosswalk. The rest of it is just whitespace changes to reformat the table so that it fits agency_fund_crosswalk, which is now the longest table name in the table.
| In 2024, the Clerk switched to reporting 78 agencies as funds underneath a separate agency. These agencies had always represented funds in the real world, but the Clerk reported them as independent taxing agencies prior to 2024. We need to account for this change when analyzing agencies and funds over time. | ||
|
|
||
| The following types of funds were affected by this change: | ||
|
|
||
| - Library funds | ||
| - General assistance funds | ||
| - Infrastructure funds (road and bridge) | ||
| - Mental health and public health funds | ||
| - Levy adjustments | ||
|
|
||
| Most tax codes contain at least one of these types of agencies. |
There was a problem hiding this comment.
I added some background here to give the reader context regarding the 2024 change.
| @@ -47,74 +47,126 @@ ptaxsim_db_conn <- DBI::dbConnect( | |||
|
|
|||
| ## Accounting for 2024 changes to agency fund reporting | |||
There was a problem hiding this comment.
I took a second pass at this doc and made a number of changes to the code/prose in addition to swapping out the crosswalk format. I hope that's OK! Down to make further edits if you disagree with any of my choices here.
| ``` | ||
|
|
||
| ## Agency fund data updates and query demo | ||
| ## Tracking specific fund revenue over time |
There was a problem hiding this comment.
This section header feels clearer to me, though I'm open to pushback.
kyrasturgill
left a comment
There was a problem hiding this comment.
This is really awesome and I'm so impressed you incorporated all of the new data schema and vignette updates so quickly. I mainly just have a few clarifying questions and corrections of some typos I had made in the vignette.
Let me know if there's anything you want to discuss further on Monday!
|
|
||
| # agency_fund_crosswalk -------------------------------------------------------- | ||
|
|
||
| changed_funds <- agency_fund_info %>% |
There was a problem hiding this comment.
[question, non-blocking] Just checking my understanding is correct - changed_funds technically refers to the funds that previously were under the agencies that starting in 2024 have become funds? Not funds that are "new" (i.e. don't end in "000")?
| fund_num_final = case_when( | ||
| # Levy adjustments (408) have the same fund numbers across all years, so | ||
| # handle them separately | ||
| fund_type_num == "408" ~ fund_num, |
There was a problem hiding this comment.
[question, non-blocking]: Technically there's nothing "new" about this fund right? The purpose of this being in the crosswalk is for a user who may want to track specifically total recapture pre- and post- 2024 for one of the agencies that had one of these changed funds?
| mutate( | ||
| fund_num_final = case_when( | ||
| # Levy adjustments (408) have the same fund numbers across all years, so | ||
| # handle them separately | ||
| fund_type_num == "408" ~ fund_num, | ||
| minor_type == "LIBRARY" ~ paste0(fund_type_num, "001"), | ||
| minor_type == "GEN ASST" ~ paste0(fund_type_num, "002"), | ||
| minor_type == "INFRA" ~ paste0(fund_type_num, "003"), | ||
| minor_type == "HEALTH" & | ||
| str_detect(agency_name, "MENTAL") ~ paste0(fund_type_num, "004"), | ||
| minor_type == "HEALTH" & | ||
| str_detect(agency_name, "PUBLIC") ~ paste0(fund_type_num, "005") | ||
| ) |
There was a problem hiding this comment.
This makes sense and is a smart workaround!
| # Starting in 2024, the data source for the `pin` table has changed. | ||
| # The current package maintainers do not have access to the old data source, | ||
| # which was a snapshot mirror of a mainframe system that has been | ||
| # decommissioned. To facilitate future updates, we copied over pre-2024 | ||
| # `pin` files without edits. These legacy files are missing some columns | ||
| # that we added in 2024, so we need to unify the schemas across files, since | ||
| # otherwise arrow will take the schema from the first file it finds in the | ||
| # dataset. In the future, we could also consider editing the old files to | ||
| # add empty values for the new columns |
| } | ||
|
|
||
| agency_fund_crosswalk <- changed_funds %>% | ||
| mutate(year = "2024") %>% |
There was a problem hiding this comment.
[question]: Is the intention here to include for year to denote for which tax year the agency_num change was introduced? I just am wondering if that could cause some friction when joining the crosswalk because it would not be useful/correct to join by agency_num and year. Does that make sense?
There was a problem hiding this comment.
I now see your explanation of what year means in the vignette which pretty perfectly answers this question! I can see why it makes sense to keep it.
| Version: 1.1.0 | ||
| Authors@R: c( | ||
| person(given = "Kyra", family = "Sturgill", email = "kyra.sturgill@cookcountyil.gov", role = c("aut", "cre")), | ||
| person(given = "Kyra", family = "Sturgill", email = "Assessor.Data@cookcountyil.gov", role = c("aut", "cre")), |
| chi_library_pension_fund_plot | ||
| ``` | ||
|
|
||
| ## Conclusion |
There was a problem hiding this comment.
I think this should be a section header to be consistent with introduction.
| ## Conclusion | |
| # Conclusion |
| chi_library_pension_fund_plot <- chi_library_pension_fund %>% | ||
| ggplot(aes(x = as.integer(year), y = final_levy)) + | ||
| geom_line(linewidth = 0.5, color = "black") + | ||
| geom_point(color = "black") + | ||
| scale_x_continuous(n.breaks = 10) + | ||
| scale_y_continuous( | ||
| labels = scales::label_dollar(scale = 1e-6, suffix = "M"), | ||
| limits = c(5e6, NA) |
There was a problem hiding this comment.
[suggestion, non-blocking]: this is a simpler chart so not really necessary but for consistency we could put similar drop down for plot code - I'm open to pushback though!
| we have updated to include a TIF counterfactual with data for tax year | ||
| 2024. |
There was a problem hiding this comment.
Just noticing in the rendered changelog there seems an unnecessary line break at tax year 2024
| @@ -267,7 +269,7 @@ The PTAXSIM backend database contains cleaned data from the Cook County Clerk, T | |||
| ## Notes and caveats | |||
There was a problem hiding this comment.
I know this is beyond the scope of this PR, but what are your thoughts on adding a note here about how PINs with an EAV less than $150 will have a $0 bill but our tax_bill() function does not have that behavior built in?
"For PINs that have a final taxable EAV less than $150, the Cook County Treasurer will by default set the PIN's tax bill total to $0; note that we do not incorporate this behavior in the tax_bill() function which will lead our calculations to not match the final bill total."
Or something along those lines.
This PR refactors the agency crosswalk data model to move crosswalk columns out of the
agency_infotable and into two dedicated tablesagency_crosswalkandagency_fund_crosswalk. There are two main advantages to this change:yearcolumn recording the year of the change, which will allow us to easily update the crosswalk in the future if agency or fund numbers change againagency_fund_crosswalktable allows for resolving individual funds over time across the change in the same way that the agency crosswalk allows for resolving agencies over timeCloses #88. See that issue for more details about this design and the reasons why we chose it.
Testing and QC
Agency crosswalk
I tested to confirm that the contents of the new agency crosswalk exactly match the contents of the existing agency crosswalk:
Agency fund crosswalk
QCing the agency fund crosswalk was harder because no source data exists for it and there isn't an existing version we can compare it to. I ended up double-checking each of the new fund numbers against the 2024 agency report Detail tab to confirm that they looked correct.
This surfaced one issue:
Otherwise, the fund numbering logic appears correct to me based on my QC.
Vignettes
I ran the vignettes locally to confirm that they render as expected. (Doing so requires downloading the new pre-release database version.)