An R package for translating data across space and time.
This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches that work across different geography and year combinations. The package also facilitates interpolation–that is, adjusting source geography/year values by their crosswalk weights and translating these values to the desired target geography/year–including diagnostics of the joins between source data and crosswalks.
The package sources crosswalks from:
- Geocorr (Missouri Census Data Center) - for inter-geography crosswalks (same-decade)
- IPUMS NHGIS - for inter-temporal crosswalks (across decades)
- CT Data Collaborative - for Connecticut 2020→2022 crosswalks (planning region changes)
- Programmatic access: No more manual downloads from web interfaces; data is cached for speed
- Standardized output: Consistent column names across all crosswalk sources
- Metadata tracking: Full provenance of crosswalks stored as attributes
- Crosswalk chaining: Automatic chaining when multiple crosswalks are required
# Install from GitHub
renv::install("UI-Research/crosswalk")First we obtain a crosswalk and apply it to our data:
library(crosswalk)
library(dplyr)
library(ggplot2)
library(stringr)
library(sf)
library(tidycensus)
library(tigris)
library(scales)
source_data = get_acs(
year = 2023,
geography = "zcta",
output = "wide",
variables = c(below_poverty_level = "B17001_002")) %>%
select(
source_geoid = GEOID,
count_below_poverty_level = below_poverty_levelE)
# Get a crosswalk from ZCTAs to PUMAs (same year, uses Geocorr (2022))
zcta_puma_crosswalk <- get_crosswalk(
source_geography = "zcta",
target_geography = "puma22",
weight = "population")
# Apply the crosswalk to your data
crosswalked_data <- crosswalk_data(
data = source_data,
crosswalk = zcta_puma_crosswalk)
## Or in a single step
crosswalked_data = crosswalk_data(
data = source_data,
source_geography = "zcta",
target_geography = "puma22",
weight = "population")What does the crosswalk(s) reflect and how was it sourced?
## and there's more (not shown)
names(attr(crosswalked_data, "crosswalk_metadata")) %>% head()
#> [1] "call_parameters" "data_source" "data_source_full_name"
#> [4] "download_url" "api_endpoint" "documentation_url"How well did the crosswalk join to our source data?
## look at all the characteristics of the join(s) between the source data
## and the crosswalks
join_quality = attr(crosswalked_data, "join_quality")
## what share of records in the source data do not join to a crosswalk and
## thus are dropped during the crosswalking process?
join_quality$pct_data_unmatched
#> [1] 0.4234277
## zctas aren't nested within states, otherwise join_quality$state_analysis_data
## would help us to ID whether non-joining source data were clustered within one
## or a few states. instead we can join to spatial data to diagnose further:
zctas_sf = zctas(year = 2023, progress_bar = FALSE)
states_sf = states(year = 2023, cb = TRUE, progress_bar = FALSE)
## apart from DC, which has a disproportionate number of non-joining ZCTAs--
## seemingly corresponding to federal areas and buildings--the distribution of
## non-joining ZCTAs appears proportionate to state-level populations and is
## distributed across many states:
zctas_sf %>%
filter(GEOID20 %in% join_quality$data_geoids_unmatched) %>%
st_intersection(states_sf %>% select(NAME)) %>%
st_drop_geometry() %>%
count(NAME, sort = TRUE) %>%
head()
#> NAME n
#> 1 District of Columbia 19
#> 2 New York 15
#> 3 Texas 9
#> 4 California 8
#> 5 Colorado 6
#> 6 Utah 6And how accurate was the crosswalking process?
comparison_data = get_acs(
year = 2023,
geography = "puma",
output = "wide",
variables = c(
below_poverty_level = "B17001_002")) %>%
select(
source_geoid = GEOID,
count_below_poverty_level_acs = below_poverty_levelE)
combined_data = left_join(
comparison_data,
crosswalked_data,
by = c("source_geoid" = "geoid"))
combined_data %>%
select(source_geoid, matches("count")) %>%
mutate(difference_percent = (count_below_poverty_level_acs - count_below_poverty_level) / count_below_poverty_level_acs) %>%
ggplot() +
geom_histogram(aes(x = difference_percent)) +
theme_minimal() +
theme(panel.grid = element_blank()) +
scale_x_continuous(labels = percent) +
labs(
title = "Crosswalked data approximates observed values",
subtitle = "Block group-level source data would produce more accurate crosswalked values",
y = "",
x = "Percent difference between observed and crosswalked values")The package has two main functions, though you can also specify the
needed crosswalk(s) directly from crosswalk_data() and omit the
intermediate get_crosswalk() call.
| Function | Purpose |
|---|---|
get_crosswalk() |
Fetch crosswalk(s) |
crosswalk_data() |
Apply crosswalk(s) to interpolate data to the target geography-year |
get_crosswalk() always returns a list structured as follows:
The list contains three elements:
| Element | Description |
|---|---|
crosswalks |
A named list of crosswalks (step_1, step_2, etc.) |
plan |
Details about what crosswalks are being fetched |
message |
A description of the crosswalk chain |
For some source year/geography -> target year/geography combinations, there is not a single direct crosswalk. In such cases, we need two crosswalks. The package automatically plans and fetches the required crosswalks:
- Step 1 (NHGIS): Change year, keep geography constant
- Step 2 (Geocorr): Change geography at target year
result <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
source_year = 2010,
target_year = 2020,
weight = "population",
silent = TRUE)
# Two crosswalks are returned
# Step 1: 2010 tracts -> 2020 tracts (NHGIS)
# Step 2: 2020 tracts -> 2020 ZCTAs (Geocorr)Each crosswalk contains standardized columns:
| Column | Description |
|---|---|
source_geoid |
Identifier for source geography |
target_geoid |
Identifier for target geography |
allocation_factor_source_to_target |
Weight for interpolating values |
weighting_factor |
What attribute was used (population, housing, land) |
Additional columns may include source_year, target_year,
population_2020, housing_2020, and land_area_sqmi depending on the
source of the crosswalk.
Each crosswalk tibble has a crosswalk_metadata attribute that
documents what the crosswalk represents and how it was created:
metadata <- attr(result$crosswalks$step_1, "crosswalk_metadata")
names(metadata)crosswalk_data() applies crosswalk weights to transform your data. If
you’re in a hurry, you can omit a call to get_crosswalk() and specify
the needed crosswalk parameters to crosswalk_data(), which will pass
these to get_crosswalk() behind the scenes. Or you can call
get_crosswalk() explicitly and then pass the result to
crosswalk_data().
The function auto-detects columns based on prefixes:
| Prefix | Treatment |
|---|---|
count_ |
Summed after weighting (for counts like population, housing units) |
mean_, median_, percent_, ratio_ |
Weighted mean (for rates, percentages, averages) |
You can also specify columns explicitly via count_columns and
non_count_columns. All non-count variables are interpolated using
weighted means, weighting by the allocation factor from the crosswalk.
get_available_crosswalks() returns a listing of all supported
year-geography combinations.
get_available_crosswalks() %>%
head()
#> # A tibble: 6 × 4
#> source_geography target_geography source_year target_year
#> <chr> <chr> <int> <int>
#> 1 block block 1990 2010
#> 2 block block 2000 2010
#> 3 block block 2010 2020
#> 4 block block 2020 2010
#> 5 block block 2020 2022
#> 6 block block 2022 2020NHGIS crosswalks require an IPUMS API key. Get one at
https://account.ipums.org/api_keys and add to your .Renviron:
usethis::edit_r_environ()
# Add: IPUMS_API_KEY=your_key_hereUse the cache parameter to save crosswalks locally for ease:
result <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
weight = "population",
cache = here::here("crosswalks-cache"))Cite the organizations that produce the crosswalks returned by this package:
For NHGIS, see requirements at: https://www.nhgis.org/citation-and-use-nhgis-data
For Geocorr, a suggested citation (update the year):
Missouri Census Data Center, University of Missouri. (2022/2018). Geocorr 2022/2018: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022/2018.html
For CTData, a suggested citation (adjust for alternate source geography):
CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved from: https://github.com/CT-Data-Collaborative/2022-tract-crosswalk.
For this package, refer here: https://ui-research.github.io/crosswalk/authors.html#citation
