Skip to content

Refactor agency crosswalk data model to better reflect underlying data change #88

@jeancochrane

Description

@jeancochrane

Background and problem statement

Now that I have finally internalized the 2024 change to agencies that has inspired the agency crosswalk (#65) and its associated vignette (#84) -- thanks in large part to @kyrasturgill's patience and wisdom -- I have some thoughts regarding how we might refactor the crosswalk data model to make it more intuitive to use. We've discussed some of these ideas in person and in Teams chats, but I want to pull the discussion out into an issue so that we can preserve it for posterity.

As a reminder, here is how we are currently implementing the "agency crosswalk" in 2024:

  • New columns on the agency_info table
    • agency_info.agency_num_24 (optional string): New agency number in 2024, or null if the agency number has not changed
    • agency_info.agency_name_24 (optional string): New agency name in 2024, or null if the agency name has not changed
    • agency_info.agency_change_24 (required boolean): True if the agency changed in 2024, and false otherwise

To me, the issue is that this data model is framed around the notion that the Clerk has changed some agency numbers in 2024 -- hence column names like agency_num_24 (in contrast to agency_num) and agency_change_24, which lead the user to understand the change to be a change of agency number. However, based on recent conversations like #85 (review), it's become apparent to me that it would be more accurate to frame the issue by saying that the Clerk has changed some agencies to funds in 2024: The Clerk reported some funds (like the City of Chicago library funds) as independent agencies prior to 2024, but now they report these funds as funds underneath a parent agency (in this case, the City of Chicago). The old independent agencies still exist in the 2024 agency report, but they all have $0 levies.

If this updated framing is indeed more accurate, then I think the current way we structure the "agency crosswalk" is slightly misleading, since it elides the substance of the change, implying that the crux of the change is that some agencies have new numbers, when in reality those agencies have become subsumed as funds under different agencies.

You could argue that this is not a particularly important concern, since we could avoid misunderstandings with more documentation. However, I think the current design also has a couple of concrete disadvantages:

  • Disadvantage 1: The current data model ignores 2024 changes to fund numbers. In the process of moving some funds from an independent agency to a parent agency, the clerk has also changed some fund numbers. For example, here is the Note Redemption fund for CPL in 2023:
year agency_num fund_num agency_name fund_name levy
2023 030210001 259000 CITY OF CHICAGO LIBRARY FUND LIBRARY NOTE REDEMPTION 114582000

Here is the same fund in 2024:

year agency_num fund_num agency_name fund_name levy
2024 030210000 259001 CITY OF CHICAGO NOTE REDEMPTION LIBRARY FUND 117145000

Note that the both the agency number and fund number have changed. If the user doesn't care about specific funds, this is probably fine, since they will just roll all of the funds up to the agency level anyway. But if the user wants to track specific funds -- say, just CPL funds -- they will have to do some digging to figure out the new fund numbers and hardcode those fund numbers into their analysis.

  • Disadvantage 2: The current data model will not be adaptable if these types of changes occur again in the future. The fields are currently tightly coupled to the expectation that this is a one-time change in 2024. If a similar change were to take place next year, we would need a new set of fields like agency_info.agency_change_25. It is our expectation (and hope) that this kind of change won't happen again, but if it does, our data model won't handle it well.

Proposed solutions

Here are a few options for how I think we might improve the agency crosswalk data model to resolve the issues I raised above.

Option 1: Dedicated crosswalk tables

This option splits out the agency_*_24 columns into dedicated tables, adds a more flexible year column, and adds support for converting fund numbers in addition to agency numbers. Read on for descriptions of the proposed tables.

Agency crosswalk

  • New table agency_crosswalk that would replace the agency_info.agency_*_24 fields that we currently maintain
    • Fields contained in this table:
      • year (required integer): Year that agency_num changed
      • agency_num (required string): The old agency number, prior to the change
      • agency_num_final (required string): The new version of the agency number following the change
    • Table would have one row for each agency that has changed at some point in time, with year set to the year that the agency changed

Here's an example of the rows that would exist in agency_crosswalk to support the change to the CPL fund:

year agency_num agency_num_final
2024 030210001 030210000

Here's how a user could use this table:

agency_cw = DBI::dbGetQuery(ptaxsim_db_conn, "SELECT * FROM agency_crosswalk")

agencies %>%
  left_join(agency_cw, by = c("agency_num")) %>%
  mutate(agency_num = coalesce(agency_num_final, agency_num))

Note that the user doesn't actually need to use the year column, because we are assuming that agency/fund transitions are one-way. If that assumption ever gets violated in the future, we would need to reconsider our approach, because even including the year column in the join wouldn't be enough to resolve the potential ambiguity. (More on that in the pros/cons section below.)

Fund crosswalk

  • New table agency_fund_crosswalk
    • year (required integer): Year that that agency_num and/or fund_num changed
    • agency_num (required string): The old agency number, prior to the change
    • agency_num_final (required string): The new agency number, after the change
    • fund_num (required string): The old fund number, prior to the change
    • fund_num_final (required string): The new fund number, after the change

Here's what agency_fund_crosswalk would like like for the CPL Note Redemption fund:

year agency_num agency_num_final fund_num fund_num_final
2024 030210001 030210000 259000 259001

Here's how a user could use this table:

agency_fund_cw = DBI::dbGetQuery(ptaxsim_db_conn, "SELECT * FROM agency_fund_crosswalk")

agency_funds %>%
  left_join(agency_fund_cw, by = c("agency_num", "fund_num") ) %>%
  mutate(
    agency_num = coalesce(agency_num_final, agency_num),
    fund_num = coalesce(fund_num_final, fund_num)
)

Pros and cons of this approach

  • Pros of this approach:
    • Meets three requirements motivating this issue:
      1. More closely matches the underlying data change
      2. Supports tracking changing funds over time using fund numbers
      3. Easy for our team to update over time, since it's just keyed by year and agency number
    • Very similar to our current design, but adds easy support for new years in the future
    • Easier for end users to understand and use than our current approach
  • Cons of this approach:
    • Medium-sized refactor that would take some time to implement
    • Introduces two new tables that we have to maintain
    • If a fund ever changed back to being an independent agency, this data model would break down, and we would need to replace it with a more sophisticated one (like option 2 below); this is due to the fact that if two agencies get consolidated at a certain point in time, and then get disaggregated at a future point in time, this crosswalk structure won't tell us which of those two agencies got disaggregated

Option 2: Dedicated crosswalk tables, filled for all years

This option modifies the tables proposed in option 1 above to fill all years prior to a change in agency/fund number. This small tweak makes year into a necessary column for joins, and makes the solution robust to the possibility that the Clerk could change some funds back to independent agencies in the future.

Agency crosswalk

  • New table agency_crosswalk that would replace the agency_info.agency_*_24 fields that we currently maintain
    • Columns contained in this table:
      • year (required integer): Year that the agency_num was active
      • agency_num (required string): Agency number in this current year
      • agency_num_final (required string): The final, canonical agency number for this agency that is guaranteed to be stable across years
        • When generating agency_crosswalk each year, we would set this value to the agency_num value for the agency in the most recent year (2024, for the current update)
    • Table would have one row for each agency that changed in each year leading up to the change

Here's how agency_crosswalk could support the change to the CPL fund. Rows start with the year 2006 since that's the first year that PTAXISM has data for this fund:

year agency_num agency_num_final
2006 030210001 030210000
2007 030210001 030210000
2008 030210001 030210000
2009 030210001 030210000
2010 030210001 030210000
2011 030210001 030210000
2012 030210001 030210000
2013 030210001 030210000
2014 030210001 030210000
2015 030210001 030210000
2016 030210001 030210000
2017 030210001 030210000
2018 030210001 030210000
2019 030210001 030210000
2020 030210001 030210000
2021 030210001 030210000
2022 030210001 030210000
2023 030210001 030210000
2024 030210000 030210000

Here's how a user could use this table:

agency_cw = DBI::dbGetQuery(ptaxsim_db_conn, "SELECT * FROM agency_crosswalk")

agencies %>%
  left_join(agency_cw, by = c("year", "agency_num")) %>%
  mutate(agency_num = coalesce(agency_num_final, agency_num))

Note that, in contrast to option 1 above, year is now part of the join key. This gives us greater control over the changes that happen year-over-year, which could be useful in particular if the Clerk changes back to reporting certain funds as independent agencies.

Fund crosswalk

  • New table agency_fund_crosswalk
    • Columns contained in this table:
      • year (required integer): Year that the agency_num and fund_num were active
      • agency_num (required string): Agency number in this year
      • agency_num_final (required string): Canonical agency number that is stable across years (i.e. number for the most recent year)
      • fund_num (required string): Fund number in this year
      • fund_num_final (required string): Canonical fund number that is stable across years (i.e. number for the most recent year)
    • Table would have one row for each fund that changed in each year leading up to the change

Here's what agency_fund_crosswalk would like like for the CPL Note Redemption fund:

year agency_num agency_num_final fund_num fund_num_final
2006 030210001 030210000 259000 259001
2007 030210001 030210000 259000 259001
2008 030210001 030210000 259000 259001
2009 030210001 030210000 259000 259001
2010 030210001 030210000 259000 259001
2011 030210001 030210000 259000 259001
2012 030210001 030210000 259000 259001
2013 030210001 030210000 259000 259001
2014 030210001 030210000 259000 259001
2015 030210001 030210000 259000 259001
2016 030210001 030210000 259000 259001
2017 030210001 030210000 259000 259001
2018 030210001 030210000 259000 259001
2019 030210001 030210000 259000 259001
2020 030210001 030210000 259000 259001
2021 030210001 030210000 259000 259001
2022 030210001 030210000 259000 259001
2023 030210001 030210000 259000 259001
2024 030210000 030210000 259001 259001

Here's how a user could use this table:

agency_fund_cw = DBI::dbGetQuery(ptaxsim_db_conn, "SELECT * FROM agency_fund_crosswalk")

agency_funds %>%
  left_join(agency_fund_cw, by = c("year", "agency_num", "fund_num") ) %>%
  mutate(
    agency_num = coalesce(agency_num_final, agency_num),
    fund_num = coalesce(fund_num_final, fund_num)
)

Pros and cons of this approach

  • Pros:
    • Meets the three requirements motivating this issue
    • Future-proof to the universe of possible changes that the Clerk could make to funds and agencies in the future
    • Very similar to how tif_crosswalk works, which is nice from the perspective of standardization
  • Cons:
    • Hard to immediately tell which data changed when based on a cursory glance at either crosswalk; you can derive the change by looking for the year when agency_num == agency_num_final or fund_num == fund_num_final, but that's not very intuitive
    • Needing to populate rows for all years prior to a change leads to a lot of extraneous rows in the table that don't contain much information
    • Clerk probably won't switch back to reporting funds as independent agencies in the future, so this solution may represent premature optimization

Option 3: Keep the current approach

  • Pros:
    • Requires no extra work, so it's fast to ship
    • Implements the fewest possible changes to the data model
  • Cons:
    • Doesn't meet any of the three requirements motivating this issue
    • May be challenging for end users to parse and understand

Summary and recommendation

Here's how I would summarize the costs and benefits of each option:

  • Option 1 (Dedicated crosswalk tables): Medium refactor, medium benefit to the user
  • Option 2 (Dedicated crosswalk tables that can be joined by year and filled automatically): Big refactor, big benefit to the user
  • Option 3 (Keep the current approach): Minimal refactor, minimal user benefit

Based on these tradeoffs, my suggestion is that we implement option 1, though I'm open to pushback or alternative ideas.


Curious what you think about this idea @kyrasturgill! Also happy to talk more on a call next week.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions