Skip to content

An R package for inter-temporal and inter-geography crosswalks

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

UI-Research/crosswalk

Repository files navigation

crosswalk

An R package for translating data across space and time.

Overview

This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches that work across different geography and year combinations. The package also facilitates interpolation–that is, adjusting source geography/year values by their crosswalk weights and translating these values to the desired target geography/year–including diagnostics of the joins between source data and crosswalks.

The package sources crosswalks from:

  • Geocorr (Missouri Census Data Center) - for inter-geography crosswalks (same-decade)
  • IPUMS NHGIS - for inter-temporal crosswalks (across decades)
  • CT Data Collaborative - for Connecticut 2020→2022 crosswalks (planning region changes)

Why Use crosswalk?

  • Programmatic access: No more manual downloads from web interfaces; data is cached for speed
  • Standardized output: Consistent column names across all crosswalk sources
  • Metadata tracking: Full provenance of crosswalks stored as attributes
  • Crosswalk chaining: Automatic chaining when multiple crosswalks are required

Installation

# Install from GitHub
renv::install("UI-Research/crosswalk")

Quick Start

First we obtain a crosswalk and apply it to our data:

library(crosswalk)
library(dplyr)
library(ggplot2)
library(stringr)
library(sf)
library(tidycensus)
library(tigris)
library(scales)

source_data = get_acs(
    year = 2023,
    geography = "zcta",
    output = "wide",
    variables = c(below_poverty_level = "B17001_002")) %>%
  select(
    source_geoid = GEOID,
    count_below_poverty_level = below_poverty_levelE)

# Get a crosswalk from ZCTAs to PUMAs (same year, uses Geocorr (2022))
zcta_puma_crosswalk <- get_crosswalk(
  source_geography = "zcta",
  target_geography = "puma22",
  weight = "population")

# Apply the crosswalk to your data
crosswalked_data <- crosswalk_data(
  data = source_data,
  crosswalk = zcta_puma_crosswalk)

## Or in a single step
crosswalked_data = crosswalk_data(
  data = source_data,
  source_geography = "zcta",
  target_geography = "puma22",
  weight = "population")

What does the crosswalk(s) reflect and how was it sourced?

## and there's more (not shown)
names(attr(crosswalked_data, "crosswalk_metadata")) %>% head()
#> [1] "call_parameters"       "data_source"           "data_source_full_name"
#> [4] "download_url"          "api_endpoint"          "documentation_url"

How well did the crosswalk join to our source data?

## look at all the characteristics of the join(s) between the source data
## and the crosswalks
join_quality = attr(crosswalked_data, "join_quality")

## what share of records in the source data do not join to a crosswalk and
## thus are dropped during the crosswalking process?
join_quality$pct_data_unmatched
#> [1] 0.4234277

## zctas aren't nested within states, otherwise join_quality$state_analysis_data 
## would help us to ID whether non-joining source data were clustered within one
## or a few states. instead we can join to spatial data to diagnose further:
zctas_sf = zctas(year = 2023, progress_bar = FALSE)
states_sf = states(year = 2023, cb = TRUE, progress_bar = FALSE)

## apart from DC, which has a disproportionate number of non-joining ZCTAs--
## seemingly corresponding to federal areas and buildings--the distribution of
## non-joining ZCTAs appears proportionate to state-level populations and is 
## distributed across many states:
zctas_sf %>%
  filter(GEOID20 %in% join_quality$data_geoids_unmatched) %>%
  st_intersection(states_sf %>% select(NAME)) %>%
  st_drop_geometry() %>%
  count(NAME, sort = TRUE) %>%
  head()
#>                   NAME  n
#> 1 District of Columbia 19
#> 2             New York 15
#> 3                Texas  9
#> 4           California  8
#> 5             Colorado  6
#> 6                 Utah  6

And how accurate was the crosswalking process?

comparison_data = get_acs(
    year = 2023,
    geography = "puma",
    output = "wide",
    variables = c(
      below_poverty_level = "B17001_002")) %>%
  select(
    source_geoid = GEOID,
    count_below_poverty_level_acs = below_poverty_levelE)

combined_data = left_join(
  comparison_data,
  crosswalked_data,
  by = c("source_geoid" = "geoid"))

combined_data %>%
  select(source_geoid, matches("count")) %>%
  mutate(difference_percent = (count_below_poverty_level_acs - count_below_poverty_level) / count_below_poverty_level_acs) %>%
  ggplot() +
    geom_histogram(aes(x = difference_percent)) +
    theme_minimal() +
    theme(panel.grid = element_blank()) +
    scale_x_continuous(labels = percent) +
    labs(
      title = "Crosswalked data approximates observed values",
      subtitle = "Block group-level source data would produce more accurate crosswalked values",
      y = "",
      x = "Percent difference between observed and crosswalked values")

Core Functions

The package has two main functions, though you can also specify the needed crosswalk(s) directly from crosswalk_data() and omit the intermediate get_crosswalk() call.

Function Purpose
get_crosswalk() Fetch crosswalk(s)
crosswalk_data() Apply crosswalk(s) to interpolate data to the target geography-year

Output Structure

get_crosswalk() always returns a list structured as follows:

The list contains three elements:

Element Description
crosswalks A named list of crosswalks (step_1, step_2, etc.)
plan Details about what crosswalks are being fetched
message A description of the crosswalk chain

Multi-Step Crosswalks

For some source year/geography -> target year/geography combinations, there is not a single direct crosswalk. In such cases, we need two crosswalks. The package automatically plans and fetches the required crosswalks:

  1. Step 1 (NHGIS): Change year, keep geography constant
  2. Step 2 (Geocorr): Change geography at target year
result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  source_year = 2010,
  target_year = 2020,
  weight = "population",
  silent = TRUE)

# Two crosswalks are returned
# Step 1: 2010 tracts -> 2020 tracts (NHGIS)
# Step 2: 2020 tracts -> 2020 ZCTAs (Geocorr)

Crosswalk Structure

Each crosswalk contains standardized columns:

Column Description
source_geoid Identifier for source geography
target_geoid Identifier for target geography
allocation_factor_source_to_target Weight for interpolating values
weighting_factor What attribute was used (population, housing, land)

Additional columns may include source_year, target_year, population_2020, housing_2020, and land_area_sqmi depending on the source of the crosswalk.

Accessing Metadata

Each crosswalk tibble has a crosswalk_metadata attribute that documents what the crosswalk represents and how it was created:

metadata <- attr(result$crosswalks$step_1, "crosswalk_metadata")
names(metadata)

Interpolation

crosswalk_data() applies crosswalk weights to transform your data. If you’re in a hurry, you can omit a call to get_crosswalk() and specify the needed crosswalk parameters to crosswalk_data(), which will pass these to get_crosswalk() behind the scenes. Or you can call get_crosswalk() explicitly and then pass the result to crosswalk_data().

Column Naming Convention

The function auto-detects columns based on prefixes:

Prefix Treatment
count_ Summed after weighting (for counts like population, housing units)
mean_, median_, percent_, ratio_ Weighted mean (for rates, percentages, averages)

You can also specify columns explicitly via count_columns and non_count_columns. All non-count variables are interpolated using weighted means, weighting by the allocation factor from the crosswalk.

Supported Geography and Year Combinations

get_available_crosswalks() returns a listing of all supported year-geography combinations.

get_available_crosswalks() %>%
  head()
#> # A tibble: 6 × 4
#>   source_geography target_geography source_year target_year
#>   <chr>            <chr>                  <int>       <int>
#> 1 block            block                   1990        2010
#> 2 block            block                   2000        2010
#> 3 block            block                   2010        2020
#> 4 block            block                   2020        2010
#> 5 block            block                   2020        2022
#> 6 block            block                   2022        2020

API Keys

NHGIS crosswalks require an IPUMS API key. Get one at https://account.ipums.org/api_keys and add to your .Renviron:

usethis::edit_r_environ()
# Add: IPUMS_API_KEY=your_key_here

Caching

Use the cache parameter to save crosswalks locally for ease:

result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  weight = "population",
  cache = here::here("crosswalks-cache"))

Citations

Cite the organizations that produce the crosswalks returned by this package:

For NHGIS, see requirements at: https://www.nhgis.org/citation-and-use-nhgis-data

For Geocorr, a suggested citation (update the year):

Missouri Census Data Center, University of Missouri. (2022/2018). Geocorr 2022/2018: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022/2018.html

For CTData, a suggested citation (adjust for alternate source geography):

CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved from: https://github.com/CT-Data-Collaborative/2022-tract-crosswalk.

For this package, refer here: https://ui-research.github.io/crosswalk/authors.html#citation

About

An R package for inter-temporal and inter-geography crosswalks

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages