Skip to content

Refactor agency crosswalk and add agency fund crosswalk#89

Open
jeancochrane wants to merge 12 commits into2024-data-updatefrom
jeancochrane/88-refactor-agency-crosswalk-data-model-to-better-reflect-underlying-data-change
Open

Refactor agency crosswalk and add agency fund crosswalk#89
jeancochrane wants to merge 12 commits into2024-data-updatefrom
jeancochrane/88-refactor-agency-crosswalk-data-model-to-better-reflect-underlying-data-change

Conversation

@jeancochrane
Copy link
Copy Markdown
Member

@jeancochrane jeancochrane commented Apr 29, 2026

This PR refactors the agency crosswalk data model to move crosswalk columns out of the agency_info table and into two dedicated tables agency_crosswalk and agency_fund_crosswalk. There are two main advantages to this change:

  1. Both tables include a year column recording the year of the change, which will allow us to easily update the crosswalk in the future if agency or fund numbers change again
  2. The agency_fund_crosswalk table allows for resolving individual funds over time across the change in the same way that the agency crosswalk allows for resolving agencies over time

Closes #88. See that issue for more details about this design and the reasons why we chose it.

Testing and QC

Agency crosswalk

I tested to confirm that the contents of the new agency crosswalk exactly match the contents of the existing agency crosswalk:

agency_cw_old <- DBI::dbGetQuery(
  db_conn_old,
  "select agency_num, agency_num_24 AS agency_num_final
  from agency_info where agency_change_24"
) %>%
  rename_with(~ paste0(., "_old"))

agency_cw_new <- DBI::dbGetQuery(
  db_conn_new,
  "select agency_num, agency_num_final
  from agency_crosswalk"
) %>%
  rename_with(~ paste0(., "_new"))

agency_cw_new %>%
  full_join(
    agency_cw_old,
    by = c("agency_num_new" = "agency_num_old"),
    keep = TRUE
  ) %>%
  mutate(
    chk_agency_num_in_old_not_new = is.na(agency_num_new),
    chk_agency_num_in_new_not_old = is.na(agency_num_old),
    chk_agency_num_final_mismatch = (
      !is.na(agency_num_new) &
        !is.na(agency_num_old) &
        agency_num_new != agency_num_old
    )
  ) %>%
  filter(if_any(starts_with("chk_"), ~ .x))

Agency fund crosswalk

QCing the agency fund crosswalk was harder because no source data exists for it and there isn't an existing version we can compare it to. I ended up double-checking each of the new fund numbers against the 2024 agency report Detail tab to confirm that they looked correct.

This surfaced one issue:

  • A few funds exist in theory but not in practice -- that is to say, they had an entry in 2023, and based on our fund numbering logic they should exist in 2024, but they do not. Based on my investigation, I believe these are all funds that had $0 levies in 2023 and so got removed from the 2024 report, but their theoretical fund numbers are still correct.
    • These funds include:
      • 206001 (library building notes -- only existed in Elmwood Park pre-2024, but had a $0 levy)
      • 003001 (library bonds -- missing for many but not all agencies in the crosswalk)
      • 003002 (general assistance bonds -- only existed for Orland in 2023, but had a $0 levy)
      • 003005 (public health bonds -- only existed for Stickney in 2023)
      • 027001 (library purchase agreement -- only existed for Markham in 2023 with $0 levy)

Otherwise, the fund numbering logic appears correct to me based on my QC.

Vignettes

I ran the vignettes locally to confirm that they render as expected. (Doing so requires downloading the new pre-release database version.)

Comment thread data-raw/agency/agency.R
Comment on lines +585 to +595
# Write both data sets to S3
arrow::write_parquet(
x = agency %>% select(-agency_name),
sink = remote_path_agency,
compression = "zstd"
)
arrow::write_parquet(
x = agency_info,
sink = remote_path_agency_info,
compression = "zstd"
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are not new, the diff is just confusing because some of the agency crosswalk code is extracted from the old agency_info code and git is confused about it.

Comment thread data-raw/agency/agency.R
Comment on lines +625 to +637
mutate(
fund_num_final = case_when(
# Levy adjustments (408) have the same fund numbers across all years, so
# handle them separately
fund_type_num == "408" ~ fund_num,
minor_type == "LIBRARY" ~ paste0(fund_type_num, "001"),
minor_type == "GEN ASST" ~ paste0(fund_type_num, "002"),
minor_type == "INFRA" ~ paste0(fund_type_num, "003"),
minor_type == "HEALTH" &
str_detect(agency_name, "MENTAL") ~ paste0(fund_type_num, "004"),
minor_type == "HEALTH" &
str_detect(agency_name, "PUBLIC") ~ paste0(fund_type_num, "005")
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These rules are not documented as far as I know, but I determined them through trial and error. My QC of all of the resulting fund numbers confirmed that they are correct.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense and is a smart workaround!

Comment thread data-raw/create_db.R
# Set the database version. This gets incremented manually whenever the database
# changes. This is checked against Config/Requires_DB_Version in the DESCRIPTION
# file via check_db_version(). Schema is:
# "MAX_YEAR_OF_DATA.MAJOR_VERSION.MINOR_VERSION-PRE_RELEASE_VERSION"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only just realized that this line is documenting the format for db_version, but db_pre_release_version is separate. I moved the pre-release component of this documentation to a separate comment on db_pre_release_version.

Comment thread data-raw/create_db.R
Comment on lines +144 to +152
# Starting in 2024, the data source for the `pin` table has changed.
# The current package maintainers do not have access to the old data source,
# which was a snapshot mirror of a mainframe system that has been
# decommissioned. To facilitate future updates, we copied over pre-2024
# `pin` files without edits. These legacy files are missing some columns
# that we added in 2024, so we need to unify the schemas across files, since
# otherwise arrow will take the schema from the first file it finds in the
# dataset. In the future, we could also consider editing the old files to
# add empty values for the new columns
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior iteration of this comment was slightly misleading, so I updated it for accuracy and clarity.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpful!

Comment thread data-raw/create_db.sql
agency_num_final varchar(9) NOT NULL,
PRIMARY KEY (agency_num)
) WITHOUT ROWID;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No indexes on either of these crosswalks because the primary keys add automatic indexes. The tables are also pretty small so the indexes shouldn't provide much performance improvement anyway.

Comment thread DESCRIPTION
Version: 1.1.0
Authors@R: c(
person(given = "Kyra", family = "Sturgill", email = "kyra.sturgill@cookcountyil.gov", role = c("aut", "cre")),
person(given = "Kyra", family = "Sturgill", email = "Assessor.Data@cookcountyil.gov", role = c("aut", "cre")),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this from your personal email to the shared team email for the sake of privacy and durability. I don't feel strongly about it though, if you'd rather have your personal email listed we can keep it that way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Comment thread README.Rmd
| tif_crosswalk | Clerk | Manually created from TIF summary and distribution reports | [data-raw/tif/tif.R](data-raw/tif/tif.R) | Fix for data issue identified in #39 |
| tif_distribution | Clerk | [TIF Reports - Tax Increment Agency Distribution Reports](https://www.cookcountyclerkil.gov/property-taxes/tifs-tax-increment-financing/tif-reports) | [data-raw/tif/tif.R](data-raw/tif/tif.R) | TIF EAV, frozen EAV, and distribution percentage by tax code |
| pin_tif_distribution | Clerk | [TIF Reports - Tax Increment Agency Distribution Reports](https://www.cookcountyclerkil.gov/property-taxes/tifs-tax-increment-financing/tif-reports) | [data-raw/tif/tif.R](data-raw/tif/tif.R) | TIF EAV, frozen EAV, and distribution percentage by PIN |
| Table Name | Source Agency | Source Link | Ingest Script | Contains |
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main diff here is adding rows for agency_crosswalk and agency_fund_crosswalk. The rest of it is just whitespace changes to reformat the table so that it fits agency_fund_crosswalk, which is now the longest table name in the table.

Comment thread vignettes/agencies.Rmd
Comment on lines +50 to +60
In 2024, the Clerk switched to reporting 78 agencies as funds underneath a separate agency. These agencies had always represented funds in the real world, but the Clerk reported them as independent taxing agencies prior to 2024. We need to account for this change when analyzing agencies and funds over time.

The following types of funds were affected by this change:

- Library funds
- General assistance funds
- Infrastructure funds (road and bridge)
- Mental health and public health funds
- Levy adjustments

Most tax codes contain at least one of these types of agencies.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some background here to give the reader context regarding the 2024 change.

Comment thread vignettes/agencies.Rmd
@@ -47,74 +47,126 @@ ptaxsim_db_conn <- DBI::dbConnect(

## Accounting for 2024 changes to agency fund reporting
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a second pass at this doc and made a number of changes to the code/prose in addition to swapping out the crosswalk format. I hope that's OK! Down to make further edits if you disagree with any of my choices here.

Comment thread vignettes/agencies.Rmd
```

## Agency fund data updates and query demo
## Tracking specific fund revenue over time
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section header feels clearer to me, though I'm open to pushback.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@jeancochrane jeancochrane marked this pull request as ready for review April 30, 2026 20:44
Copy link
Copy Markdown
Member

@kyrasturgill kyrasturgill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really awesome and I'm so impressed you incorporated all of the new data schema and vignette updates so quickly. I mainly just have a few clarifying questions and corrections of some typos I had made in the vignette.
Let me know if there's anything you want to discuss further on Monday!

Comment thread data-raw/agency/agency.R

# agency_fund_crosswalk --------------------------------------------------------

changed_funds <- agency_fund_info %>%
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question, non-blocking] Just checking my understanding is correct - changed_funds technically refers to the funds that previously were under the agencies that starting in 2024 have become funds? Not funds that are "new" (i.e. don't end in "000")?

Comment thread data-raw/agency/agency.R
fund_num_final = case_when(
# Levy adjustments (408) have the same fund numbers across all years, so
# handle them separately
fund_type_num == "408" ~ fund_num,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question, non-blocking]: Technically there's nothing "new" about this fund right? The purpose of this being in the crosswalk is for a user who may want to track specifically total recapture pre- and post- 2024 for one of the agencies that had one of these changed funds?

Comment thread data-raw/agency/agency.R
Comment on lines +625 to +637
mutate(
fund_num_final = case_when(
# Levy adjustments (408) have the same fund numbers across all years, so
# handle them separately
fund_type_num == "408" ~ fund_num,
minor_type == "LIBRARY" ~ paste0(fund_type_num, "001"),
minor_type == "GEN ASST" ~ paste0(fund_type_num, "002"),
minor_type == "INFRA" ~ paste0(fund_type_num, "003"),
minor_type == "HEALTH" &
str_detect(agency_name, "MENTAL") ~ paste0(fund_type_num, "004"),
minor_type == "HEALTH" &
str_detect(agency_name, "PUBLIC") ~ paste0(fund_type_num, "005")
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense and is a smart workaround!

Comment thread data-raw/create_db.R
Comment on lines +144 to +152
# Starting in 2024, the data source for the `pin` table has changed.
# The current package maintainers do not have access to the old data source,
# which was a snapshot mirror of a mainframe system that has been
# decommissioned. To facilitate future updates, we copied over pre-2024
# `pin` files without edits. These legacy files are missing some columns
# that we added in 2024, so we need to unify the schemas across files, since
# otherwise arrow will take the schema from the first file it finds in the
# dataset. In the future, we could also consider editing the old files to
# add empty values for the new columns
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpful!

Comment thread data-raw/agency/agency.R
}

agency_fund_crosswalk <- changed_funds %>%
mutate(year = "2024") %>%
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question]: Is the intention here to include for year to denote for which tax year the agency_num change was introduced? I just am wondering if that could cause some friction when joining the crosswalk because it would not be useful/correct to join by agency_num and year. Does that make sense?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see your explanation of what year means in the vignette which pretty perfectly answers this question! I can see why it makes sense to keep it.

Comment thread DESCRIPTION
Version: 1.1.0
Authors@R: c(
person(given = "Kyra", family = "Sturgill", email = "kyra.sturgill@cookcountyil.gov", role = c("aut", "cre")),
person(given = "Kyra", family = "Sturgill", email = "Assessor.Data@cookcountyil.gov", role = c("aut", "cre")),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Comment thread vignettes/agencies.Rmd
chi_library_pension_fund_plot
```

## Conclusion
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a section header to be consistent with introduction.

Suggested change
## Conclusion
# Conclusion

Comment thread vignettes/agencies.Rmd
Comment on lines +605 to +612
chi_library_pension_fund_plot <- chi_library_pension_fund %>%
ggplot(aes(x = as.integer(year), y = final_levy)) +
geom_line(linewidth = 0.5, color = "black") +
geom_point(color = "black") +
scale_x_continuous(n.breaks = 10) +
scale_y_continuous(
labels = scales::label_dollar(scale = 1e-6, suffix = "M"),
limits = c(5e6, NA)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion, non-blocking]: this is a simpler chart so not really necessary but for consistency we could put similar drop down for plot code - I'm open to pushback though!

Comment thread NEWS.md
Comment on lines 101 to 102
we have updated to include a TIF counterfactual with data for tax year
2024.
Copy link
Copy Markdown
Member

@kyrasturgill kyrasturgill May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticing in the rendered changelog there seems an unnecessary line break at tax year 2024

Comment thread README.Rmd
@@ -267,7 +269,7 @@ The PTAXSIM backend database contains cleaned data from the Cook County Clerk, T
## Notes and caveats
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is beyond the scope of this PR, but what are your thoughts on adding a note here about how PINs with an EAV less than $150 will have a $0 bill but our tax_bill() function does not have that behavior built in?

"For PINs that have a final taxable EAV less than $150, the Cook County Treasurer will by default set the PIN's tax bill total to $0; note that we do not incorporate this behavior in the tax_bill() function which will lead our calculations to not match the final bill total."

Or something along those lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor agency crosswalk data model to better reflect underlying data change

2 participants