crosswalk

An R package for translating data across space and time.

Overview

This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches that work across different geography and year combinations. The package also facilitates interpolation–that is, adjusting source geography/year values by their crosswalk weights and translating these values to the desired target geography/year–including diagnostics of the joins between source data and crosswalks.

The package sources crosswalks from:

Geocorr (Missouri Census Data Center) - for inter-geography crosswalks (same-decade)
IPUMS NHGIS - for inter-temporal crosswalks (across decades)
CT Data Collaborative - for Connecticut 2020→2022 crosswalks (planning region changes)

Why Use `crosswalk`?

Programmatic access: No more manual downloads from web interfaces; data is cached for speed
Standardized output: Consistent column names across all crosswalk sources
Metadata tracking: Full provenance of crosswalks stored as attributes
Crosswalk chaining: Automatic chaining when multiple crosswalks are required

Installation

# Install from GitHub
renv::install("UI-Research/crosswalk")

Quick Start

First we obtain a crosswalk and apply it to our data:

library(crosswalk)
library(dplyr)
library(ggplot2)
library(stringr)
library(sf)
library(tidycensus)
library(tigris)
library(scales)

source_data = get_acs(
    year = 2023,
    geography = "zcta",
    output = "wide",
    variables = c(below_poverty_level = "B17001_002")) %>%
  select(
    source_geoid = GEOID,
    count_below_poverty_level = below_poverty_levelE)

# Get a crosswalk from ZCTAs to PUMAs (same year, uses Geocorr (2022))
zcta_puma_crosswalk <- get_crosswalk(
  source_geography = "zcta",
  target_geography = "puma22",
  weight = "population")

# Apply the crosswalk to your data
crosswalked_data <- crosswalk_data(
  data = source_data,
  crosswalk = zcta_puma_crosswalk)

## Or in a single step
crosswalked_data = crosswalk_data(
  data = source_data,
  source_geography = "zcta",
  target_geography = "puma22",
  weight = "population")

What does the crosswalk(s) reflect and how was it sourced?

## and there's more (not shown)
names(attr(crosswalked_data, "crosswalk_metadata")) %>% head()
#> [1] "call_parameters"       "data_source"           "data_source_full_name"
#> [4] "download_url"          "api_endpoint"          "documentation_url"

How well did the crosswalk join to our source data?

## look at all the characteristics of the join(s) between the source data
## and the crosswalks
join_quality = attr(crosswalked_data, "join_quality")

## what share of records in the source data do not join to a crosswalk and
## thus are dropped during the crosswalking process?
join_quality$pct_data_unmatched
#> [1] 0.4234277

## zctas aren't nested within states, otherwise join_quality$state_analysis_data 
## would help us to ID whether non-joining source data were clustered within one
## or a few states. instead we can join to spatial data to diagnose further:
zctas_sf = zctas(year = 2023, progress_bar = FALSE)
states_sf = states(year = 2023, cb = TRUE, progress_bar = FALSE)

## apart from DC, which has a disproportionate number of non-joining ZCTAs--
## seemingly corresponding to federal areas and buildings--the distribution of
## non-joining ZCTAs appears proportionate to state-level populations and is 
## distributed across many states:
zctas_sf %>%
  filter(GEOID20 %in% join_quality$data_geoids_unmatched) %>%
  st_intersection(states_sf %>% select(NAME)) %>%
  st_drop_geometry() %>%
  count(NAME, sort = TRUE) %>%
  head()
#>                   NAME  n
#> 1 District of Columbia 19
#> 2             New York 15
#> 3                Texas  9
#> 4           California  8
#> 5             Colorado  6
#> 6                 Utah  6

And how accurate was the crosswalking process?

comparison_data = get_acs(
    year = 2023,
    geography = "puma",
    output = "wide",
    variables = c(
      below_poverty_level = "B17001_002")) %>%
  select(
    source_geoid = GEOID,
    count_below_poverty_level_acs = below_poverty_levelE)

combined_data = left_join(
  comparison_data,
  crosswalked_data,
  by = c("source_geoid" = "geoid"))

combined_data %>%
  select(source_geoid, matches("count")) %>%
  mutate(difference_percent = (count_below_poverty_level_acs - count_below_poverty_level) / count_below_poverty_level_acs) %>%
  ggplot() +
    geom_histogram(aes(x = difference_percent)) +
    theme_minimal() +
    theme(panel.grid = element_blank()) +
    scale_x_continuous(labels = percent) +
    labs(
      title = "Crosswalked data approximates observed values",
      subtitle = "Block group-level source data would produce more accurate crosswalked values",
      y = "",
      x = "Percent difference between observed and crosswalked values")

Core Functions

The package has two main functions, though you can also specify the needed crosswalk(s) directly from crosswalk_data() and omit the intermediate get_crosswalk() call.

Function	Purpose
`get_crosswalk()`	Fetch crosswalk(s)
`crosswalk_data()`	Apply crosswalk(s) to interpolate data to the target geography-year

Output Structure

get_crosswalk() always returns a list structured as follows:

The list contains three elements:

Element	Description
`crosswalks`	A named list of crosswalks (`step_1`, `step_2`, etc.)
`plan`	Details about what crosswalks are being fetched
`message`	A description of the crosswalk chain

Multi-Step Crosswalks

For some source year/geography -> target year/geography combinations, there is not a single direct crosswalk. In such cases, we need two crosswalks. The package automatically plans and fetches the required crosswalks:

Step 1 (NHGIS): Change year, keep geography constant
Step 2 (Geocorr): Change geography at target year

result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  source_year = 2010,
  target_year = 2020,
  weight = "population",
  silent = TRUE)

# Two crosswalks are returned
# Step 1: 2010 tracts -> 2020 tracts (NHGIS)
# Step 2: 2020 tracts -> 2020 ZCTAs (Geocorr)

Crosswalk Structure

Each crosswalk contains standardized columns:

Column	Description
`source_geoid`	Identifier for source geography
`target_geoid`	Identifier for target geography
`allocation_factor_source_to_target`	Weight for interpolating values
`weighting_factor`	What attribute was used (population, housing, land)

Additional columns may include source_year, target_year, population_2020, housing_2020, and land_area_sqmi depending on the source of the crosswalk.

Accessing Metadata

Each crosswalk tibble has a crosswalk_metadata attribute that documents what the crosswalk represents and how it was created:

metadata <- attr(result$crosswalks$step_1, "crosswalk_metadata")
names(metadata)

Interpolation

crosswalk_data() applies crosswalk weights to transform your data. If you’re in a hurry, you can omit a call to get_crosswalk() and specify the needed crosswalk parameters to crosswalk_data(), which will pass these to get_crosswalk() behind the scenes. Or you can call get_crosswalk() explicitly and then pass the result to crosswalk_data().

Column Naming Convention

The function auto-detects columns based on prefixes:

Prefix	Treatment
`count_`	Summed after weighting (for counts like population, housing units)
`mean_`, `median_`, `percent_`, `ratio_`	Weighted mean (for rates, percentages, averages)

You can also specify columns explicitly via count_columns and non_count_columns. All non-count variables are interpolated using weighted means, weighting by the allocation factor from the crosswalk.

Supported Geography and Year Combinations

get_available_crosswalks() returns a listing of all supported year-geography combinations.

get_available_crosswalks() %>%
  head()
#> # A tibble: 6 × 4
#>   source_geography target_geography source_year target_year
#>   <chr>            <chr>                  <int>       <int>
#> 1 block            block                   1990        2010
#> 2 block            block                   2000        2010
#> 3 block            block                   2010        2020
#> 4 block            block                   2020        2010
#> 5 block            block                   2020        2022
#> 6 block            block                   2022        2020

API Keys

NHGIS crosswalks require an IPUMS API key. Get one at https://account.ipums.org/api_keys and add to your .Renviron:

usethis::edit_r_environ()
# Add: IPUMS_API_KEY=your_key_here

Caching

Use the cache parameter to save crosswalks locally for ease:

result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  weight = "population",
  cache = here::here("crosswalks-cache"))

Citations

Cite the organizations that produce the crosswalks returned by this package:

For NHGIS, see requirements at: https://www.nhgis.org/citation-and-use-nhgis-data

For Geocorr, a suggested citation (update the year):

Missouri Census Data Center, University of Missouri. (2022/2018). Geocorr 2022/2018: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022/2018.html

For CTData, a suggested citation (adjust for alternate source geography):

CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved from: https://github.com/CT-Data-Collaborative/2022-tract-crosswalk.

For this package, refer here: https://ui-research.github.io/crosswalk/authors.html#citation

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
R		R
man		man
renv		renv
scripts		scripts
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
crosswalk.Rproj		crosswalk.Rproj
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

crosswalk

Overview

Why Use `crosswalk`?

Installation

Quick Start

Core Functions

Output Structure

Multi-Step Crosswalks

Crosswalk Structure

Accessing Metadata

Interpolation

Column Naming Convention

Supported Geography and Year Combinations

API Keys

Caching

Citations

About

Licenses found

Uh oh!

Releases

Packages

Languages

License

Licenses found

UI-Research/crosswalk

Folders and files

Latest commit

History

Repository files navigation

crosswalk

Overview

Why Use crosswalk?

Installation

Quick Start

Core Functions

Output Structure

Multi-Step Crosswalks

Crosswalk Structure

Accessing Metadata

Interpolation

Column Naming Convention

Supported Geography and Year Combinations

API Keys

Caching

Citations

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Why Use `crosswalk`?

Packages