Background and problem statement
Now that I have finally internalized the 2024 change to agencies that has inspired the agency crosswalk (#65) and its associated vignette (#84) -- thanks in large part to @kyrasturgill's patience and wisdom -- I have some thoughts regarding how we might refactor the crosswalk data model to make it more intuitive to use. We've discussed some of these ideas in person and in Teams chats, but I want to pull the discussion out into an issue so that we can preserve it for posterity.
As a reminder, here is how we are currently implementing the "agency crosswalk" in 2024:
- New columns on the
agency_info table
agency_info.agency_num_24 (optional string): New agency number in 2024, or null if the agency number has not changed
agency_info.agency_name_24 (optional string): New agency name in 2024, or null if the agency name has not changed
agency_info.agency_change_24 (required boolean): True if the agency changed in 2024, and false otherwise
To me, the issue is that this data model is framed around the notion that the Clerk has changed some agency numbers in 2024 -- hence column names like agency_num_24 (in contrast to agency_num) and agency_change_24, which lead the user to understand the change to be a change of agency number. However, based on recent conversations like #85 (review), it's become apparent to me that it would be more accurate to frame the issue by saying that the Clerk has changed some agencies to funds in 2024: The Clerk reported some funds (like the City of Chicago library funds) as independent agencies prior to 2024, but now they report these funds as funds underneath a parent agency (in this case, the City of Chicago). The old independent agencies still exist in the 2024 agency report, but they all have $0 levies.
If this updated framing is indeed more accurate, then I think the current way we structure the "agency crosswalk" is slightly misleading, since it elides the substance of the change, implying that the crux of the change is that some agencies have new numbers, when in reality those agencies have become subsumed as funds under different agencies.
You could argue that this is not a particularly important concern, since we could avoid misunderstandings with more documentation. However, I think the current design also has a couple of concrete disadvantages:
- Disadvantage 1: The current data model ignores 2024 changes to fund numbers. In the process of moving some funds from an independent agency to a parent agency, the clerk has also changed some fund numbers. For example, here is the Note Redemption fund for CPL in 2023:
| year |
agency_num |
fund_num |
agency_name |
fund_name |
levy |
| 2023 |
030210001 |
259000 |
CITY OF CHICAGO LIBRARY FUND |
LIBRARY NOTE REDEMPTION |
114582000 |
Here is the same fund in 2024:
| year |
agency_num |
fund_num |
agency_name |
fund_name |
levy |
| 2024 |
030210000 |
259001 |
CITY OF CHICAGO |
NOTE REDEMPTION LIBRARY FUND |
117145000 |
Note that the both the agency number and fund number have changed. If the user doesn't care about specific funds, this is probably fine, since they will just roll all of the funds up to the agency level anyway. But if the user wants to track specific funds -- say, just CPL funds -- they will have to do some digging to figure out the new fund numbers and hardcode those fund numbers into their analysis.
- Disadvantage 2: The current data model will not be adaptable if these types of changes occur again in the future. The fields are currently tightly coupled to the expectation that this is a one-time change in 2024. If a similar change were to take place next year, we would need a new set of fields like
agency_info.agency_change_25. It is our expectation (and hope) that this kind of change won't happen again, but if it does, our data model won't handle it well.
Proposed solutions
Here are a few options for how I think we might improve the agency crosswalk data model to resolve the issues I raised above.
Option 1: Dedicated crosswalk tables
This option splits out the agency_*_24 columns into dedicated tables, adds a more flexible year column, and adds support for converting fund numbers in addition to agency numbers. Read on for descriptions of the proposed tables.
Agency crosswalk
- New table
agency_crosswalk that would replace the agency_info.agency_*_24 fields that we currently maintain
- Fields contained in this table:
year (required integer): Year that agency_num changed
agency_num (required string): The old agency number, prior to the change
agency_num_final (required string): The new version of the agency number following the change
- Table would have one row for each agency that has changed at some point in time, with
year set to the year that the agency changed
Here's an example of the rows that would exist in agency_crosswalk to support the change to the CPL fund:
| year |
agency_num |
agency_num_final |
| 2024 |
030210001 |
030210000 |
Here's how a user could use this table:
agency_cw = DBI::dbGetQuery(ptaxsim_db_conn, "SELECT * FROM agency_crosswalk")
agencies %>%
left_join(agency_cw, by = c("agency_num")) %>%
mutate(agency_num = coalesce(agency_num_final, agency_num))
Note that the user doesn't actually need to use the year column, because we are assuming that agency/fund transitions are one-way. If that assumption ever gets violated in the future, we would need to reconsider our approach, because even including the year column in the join wouldn't be enough to resolve the potential ambiguity. (More on that in the pros/cons section below.)
Fund crosswalk
- New table
agency_fund_crosswalk
year (required integer): Year that that agency_num and/or fund_num changed
agency_num (required string): The old agency number, prior to the change
agency_num_final (required string): The new agency number, after the change
fund_num (required string): The old fund number, prior to the change
fund_num_final (required string): The new fund number, after the change
Here's what agency_fund_crosswalk would like like for the CPL Note Redemption fund:
| year |
agency_num |
agency_num_final |
fund_num |
fund_num_final |
| 2024 |
030210001 |
030210000 |
259000 |
259001 |
Here's how a user could use this table:
agency_fund_cw = DBI::dbGetQuery(ptaxsim_db_conn, "SELECT * FROM agency_fund_crosswalk")
agency_funds %>%
left_join(agency_fund_cw, by = c("agency_num", "fund_num") ) %>%
mutate(
agency_num = coalesce(agency_num_final, agency_num),
fund_num = coalesce(fund_num_final, fund_num)
)
Pros and cons of this approach
- Pros of this approach:
- Meets three requirements motivating this issue:
- More closely matches the underlying data change
- Supports tracking changing funds over time using fund numbers
- Easy for our team to update over time, since it's just keyed by year and agency number
- Very similar to our current design, but adds easy support for new years in the future
- Easier for end users to understand and use than our current approach
- Cons of this approach:
- Medium-sized refactor that would take some time to implement
- Introduces two new tables that we have to maintain
- If a fund ever changed back to being an independent agency, this data model would break down, and we would need to replace it with a more sophisticated one (like option 2 below); this is due to the fact that if two agencies get consolidated at a certain point in time, and then get disaggregated at a future point in time, this crosswalk structure won't tell us which of those two agencies got disaggregated
Option 2: Dedicated crosswalk tables, filled for all years
This option modifies the tables proposed in option 1 above to fill all years prior to a change in agency/fund number. This small tweak makes year into a necessary column for joins, and makes the solution robust to the possibility that the Clerk could change some funds back to independent agencies in the future.
Agency crosswalk
- New table
agency_crosswalk that would replace the agency_info.agency_*_24 fields that we currently maintain
- Columns contained in this table:
year (required integer): Year that the agency_num was active
agency_num (required string): Agency number in this current year
agency_num_final (required string): The final, canonical agency number for this agency that is guaranteed to be stable across years
- When generating
agency_crosswalk each year, we would set this value to the agency_num value for the agency in the most recent year (2024, for the current update)
- Table would have one row for each agency that changed in each year leading up to the change
Here's how agency_crosswalk could support the change to the CPL fund. Rows start with the year 2006 since that's the first year that PTAXISM has data for this fund:
| year |
agency_num |
agency_num_final |
| 2006 |
030210001 |
030210000 |
| 2007 |
030210001 |
030210000 |
| 2008 |
030210001 |
030210000 |
| 2009 |
030210001 |
030210000 |
| 2010 |
030210001 |
030210000 |
| 2011 |
030210001 |
030210000 |
| 2012 |
030210001 |
030210000 |
| 2013 |
030210001 |
030210000 |
| 2014 |
030210001 |
030210000 |
| 2015 |
030210001 |
030210000 |
| 2016 |
030210001 |
030210000 |
| 2017 |
030210001 |
030210000 |
| 2018 |
030210001 |
030210000 |
| 2019 |
030210001 |
030210000 |
| 2020 |
030210001 |
030210000 |
| 2021 |
030210001 |
030210000 |
| 2022 |
030210001 |
030210000 |
| 2023 |
030210001 |
030210000 |
| 2024 |
030210000 |
030210000 |
Here's how a user could use this table:
agency_cw = DBI::dbGetQuery(ptaxsim_db_conn, "SELECT * FROM agency_crosswalk")
agencies %>%
left_join(agency_cw, by = c("year", "agency_num")) %>%
mutate(agency_num = coalesce(agency_num_final, agency_num))
Note that, in contrast to option 1 above, year is now part of the join key. This gives us greater control over the changes that happen year-over-year, which could be useful in particular if the Clerk changes back to reporting certain funds as independent agencies.
Fund crosswalk
- New table
agency_fund_crosswalk
- Columns contained in this table:
year (required integer): Year that the agency_num and fund_num were active
agency_num (required string): Agency number in this year
agency_num_final (required string): Canonical agency number that is stable across years (i.e. number for the most recent year)
fund_num (required string): Fund number in this year
fund_num_final (required string): Canonical fund number that is stable across years (i.e. number for the most recent year)
- Table would have one row for each fund that changed in each year leading up to the change
Here's what agency_fund_crosswalk would like like for the CPL Note Redemption fund:
| year |
agency_num |
agency_num_final |
fund_num |
fund_num_final |
| 2006 |
030210001 |
030210000 |
259000 |
259001 |
| 2007 |
030210001 |
030210000 |
259000 |
259001 |
| 2008 |
030210001 |
030210000 |
259000 |
259001 |
| 2009 |
030210001 |
030210000 |
259000 |
259001 |
| 2010 |
030210001 |
030210000 |
259000 |
259001 |
| 2011 |
030210001 |
030210000 |
259000 |
259001 |
| 2012 |
030210001 |
030210000 |
259000 |
259001 |
| 2013 |
030210001 |
030210000 |
259000 |
259001 |
| 2014 |
030210001 |
030210000 |
259000 |
259001 |
| 2015 |
030210001 |
030210000 |
259000 |
259001 |
| 2016 |
030210001 |
030210000 |
259000 |
259001 |
| 2017 |
030210001 |
030210000 |
259000 |
259001 |
| 2018 |
030210001 |
030210000 |
259000 |
259001 |
| 2019 |
030210001 |
030210000 |
259000 |
259001 |
| 2020 |
030210001 |
030210000 |
259000 |
259001 |
| 2021 |
030210001 |
030210000 |
259000 |
259001 |
| 2022 |
030210001 |
030210000 |
259000 |
259001 |
| 2023 |
030210001 |
030210000 |
259000 |
259001 |
| 2024 |
030210000 |
030210000 |
259001 |
259001 |
Here's how a user could use this table:
agency_fund_cw = DBI::dbGetQuery(ptaxsim_db_conn, "SELECT * FROM agency_fund_crosswalk")
agency_funds %>%
left_join(agency_fund_cw, by = c("year", "agency_num", "fund_num") ) %>%
mutate(
agency_num = coalesce(agency_num_final, agency_num),
fund_num = coalesce(fund_num_final, fund_num)
)
Pros and cons of this approach
- Pros:
- Meets the three requirements motivating this issue
- Future-proof to the universe of possible changes that the Clerk could make to funds and agencies in the future
- Very similar to how
tif_crosswalk works, which is nice from the perspective of standardization
- Cons:
- Hard to immediately tell which data changed when based on a cursory glance at either crosswalk; you can derive the change by looking for the year when
agency_num == agency_num_final or fund_num == fund_num_final, but that's not very intuitive
- Needing to populate rows for all years prior to a change leads to a lot of extraneous rows in the table that don't contain much information
- Clerk probably won't switch back to reporting funds as independent agencies in the future, so this solution may represent premature optimization
Option 3: Keep the current approach
- Pros:
- Requires no extra work, so it's fast to ship
- Implements the fewest possible changes to the data model
- Cons:
- Doesn't meet any of the three requirements motivating this issue
- May be challenging for end users to parse and understand
Summary and recommendation
Here's how I would summarize the costs and benefits of each option:
- Option 1 (Dedicated crosswalk tables): Medium refactor, medium benefit to the user
- Option 2 (Dedicated crosswalk tables that can be joined by
year and filled automatically): Big refactor, big benefit to the user
- Option 3 (Keep the current approach): Minimal refactor, minimal user benefit
Based on these tradeoffs, my suggestion is that we implement option 1, though I'm open to pushback or alternative ideas.
Curious what you think about this idea @kyrasturgill! Also happy to talk more on a call next week.
Background and problem statement
Now that I have finally internalized the 2024 change to agencies that has inspired the agency crosswalk (#65) and its associated vignette (#84) -- thanks in large part to @kyrasturgill's patience and wisdom -- I have some thoughts regarding how we might refactor the crosswalk data model to make it more intuitive to use. We've discussed some of these ideas in person and in Teams chats, but I want to pull the discussion out into an issue so that we can preserve it for posterity.
As a reminder, here is how we are currently implementing the "agency crosswalk" in 2024:
agency_infotableagency_info.agency_num_24(optional string): New agency number in 2024, or null if the agency number has not changedagency_info.agency_name_24(optional string): New agency name in 2024, or null if the agency name has not changedagency_info.agency_change_24(required boolean): True if the agency changed in 2024, and false otherwiseTo me, the issue is that this data model is framed around the notion that the Clerk has changed some agency numbers in 2024 -- hence column names like
agency_num_24(in contrast toagency_num) andagency_change_24, which lead the user to understand the change to be a change of agency number. However, based on recent conversations like #85 (review), it's become apparent to me that it would be more accurate to frame the issue by saying that the Clerk has changed some agencies to funds in 2024: The Clerk reported some funds (like the City of Chicago library funds) as independent agencies prior to 2024, but now they report these funds as funds underneath a parent agency (in this case, the City of Chicago). The old independent agencies still exist in the 2024 agency report, but they all have $0 levies.If this updated framing is indeed more accurate, then I think the current way we structure the "agency crosswalk" is slightly misleading, since it elides the substance of the change, implying that the crux of the change is that some agencies have new numbers, when in reality those agencies have become subsumed as funds under different agencies.
You could argue that this is not a particularly important concern, since we could avoid misunderstandings with more documentation. However, I think the current design also has a couple of concrete disadvantages:
Here is the same fund in 2024:
Note that the both the agency number and fund number have changed. If the user doesn't care about specific funds, this is probably fine, since they will just roll all of the funds up to the agency level anyway. But if the user wants to track specific funds -- say, just CPL funds -- they will have to do some digging to figure out the new fund numbers and hardcode those fund numbers into their analysis.
agency_info.agency_change_25. It is our expectation (and hope) that this kind of change won't happen again, but if it does, our data model won't handle it well.Proposed solutions
Here are a few options for how I think we might improve the agency crosswalk data model to resolve the issues I raised above.
Option 1: Dedicated crosswalk tables
This option splits out the
agency_*_24columns into dedicated tables, adds a more flexibleyearcolumn, and adds support for converting fund numbers in addition to agency numbers. Read on for descriptions of the proposed tables.Agency crosswalk
agency_crosswalkthat would replace theagency_info.agency_*_24fields that we currently maintainyear(required integer): Year thatagency_numchangedagency_num(required string): The old agency number, prior to the changeagency_num_final(required string): The new version of the agency number following the changeyearset to the year that the agency changedHere's an example of the rows that would exist in
agency_crosswalkto support the change to the CPL fund:Here's how a user could use this table:
Note that the user doesn't actually need to use the
yearcolumn, because we are assuming that agency/fund transitions are one-way. If that assumption ever gets violated in the future, we would need to reconsider our approach, because even including theyearcolumn in the join wouldn't be enough to resolve the potential ambiguity. (More on that in the pros/cons section below.)Fund crosswalk
agency_fund_crosswalkyear(required integer): Year that thatagency_numand/orfund_numchangedagency_num(required string): The old agency number, prior to the changeagency_num_final(required string): The new agency number, after the changefund_num(required string): The old fund number, prior to the changefund_num_final(required string): The new fund number, after the changeHere's what
agency_fund_crosswalkwould like like for the CPL Note Redemption fund:Here's how a user could use this table:
Pros and cons of this approach
Option 2: Dedicated crosswalk tables, filled for all years
This option modifies the tables proposed in option 1 above to fill all years prior to a change in agency/fund number. This small tweak makes
yearinto a necessary column for joins, and makes the solution robust to the possibility that the Clerk could change some funds back to independent agencies in the future.Agency crosswalk
agency_crosswalkthat would replace theagency_info.agency_*_24fields that we currently maintainyear(required integer): Year that theagency_numwas activeagency_num(required string): Agency number in this current yearagency_num_final(required string): The final, canonical agency number for this agency that is guaranteed to be stable across yearsagency_crosswalkeach year, we would set this value to theagency_numvalue for the agency in the most recent year (2024, for the current update)Here's how
agency_crosswalkcould support the change to the CPL fund. Rows start with the year 2006 since that's the first year that PTAXISM has data for this fund:Here's how a user could use this table:
Note that, in contrast to option 1 above,
yearis now part of the join key. This gives us greater control over the changes that happen year-over-year, which could be useful in particular if the Clerk changes back to reporting certain funds as independent agencies.Fund crosswalk
agency_fund_crosswalkyear(required integer): Year that theagency_numandfund_numwere activeagency_num(required string): Agency number in this yearagency_num_final(required string): Canonical agency number that is stable across years (i.e. number for the most recent year)fund_num(required string): Fund number in this yearfund_num_final(required string): Canonical fund number that is stable across years (i.e. number for the most recent year)Here's what
agency_fund_crosswalkwould like like for the CPL Note Redemption fund:Here's how a user could use this table:
Pros and cons of this approach
tif_crosswalkworks, which is nice from the perspective of standardizationagency_num == agency_num_finalorfund_num == fund_num_final, but that's not very intuitiveOption 3: Keep the current approach
Summary and recommendation
Here's how I would summarize the costs and benefits of each option:
yearand filled automatically): Big refactor, big benefit to the userBased on these tradeoffs, my suggestion is that we implement option 1, though I'm open to pushback or alternative ideas.
Curious what you think about this idea @kyrasturgill! Also happy to talk more on a call next week.