IPEDS Mirror

This project is a set of R scripts that will automatically download, unzip, and mirror the Access database files from IPEDS to a DuckDB database.

Why

Doing time series analysis with IPEDS data can be annoying. You have to download a file for each year you need. That can turn into multiple files per year if you are utilizing variables that are not stored in the same datafile. If you instead download the Access database files, you then have these large files that are not in a very flexible format.

DuckDB is a SQL database that is very flexible, compatible with most popular programming languages.

For 21 years (2004-2024) of data, the ZIP files sit at 1.24 GB, the unzipped Access files at over 10 GB, and the resulting DuckDB file at 2.2 GB.

Usage

This project uses just, but it really isn't needed.

just process

You can just as easily run the script directly with:

Rscript main.R

Requirements

This uses a handful of R libraries.

install.packages(c(
    "tidyverse",
    "httr2",
    "rvest",
    "duckdb"
))

Internals

Each file within the R directory contains only functions which are called by main.R.

check.R: Checks the IPEDS website, the any existing ZIP files, and any existing DuckDB database.
download.R: Downloads any missing ZIP files.
access.R: Unzip and create ODBC connections for each of the Access files.
metadata.R: Pull some of the metadata from the Access files.
mirror.R: Extract all of the data tables from the Access files and appends them across time, writing/appending them to the DuckDB file.

The main.R file loads the libraries, opens the DuckDB file, sources the files within R, and calls their functions.

The result is a database file, mirror.duckdb, which contains each table with all the years concatenated. So for instance, the HD table will have all years and columns with an additional identifier of year. The table names have the year portion of their name removed so the C2020A table is part of the C_A table. The metadata is found in the tables, variables, and value_sets tables.

I've built this with some attempts at optimizing for speed, but I'm sure the process could be way faster.

Similar Projects

(I haven't tried these projects. They are just what I found was a search on GitHub.)

paulgp/ipeds-database
- Python, DuckDB
stanislavzza/IPEDSR
- R package, DuckDB
- Seems to have the same general idea with a very different approach and different goals.
jbryer/ipeds
- R package, last updated in 2023
{IPEDSuploadable} looks cool but is for submitting data to IPEDS, so a different objective.
There are others that haven't been updated in many years or seem like a one-off data project.
- GitHub Search

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
R		R
.gitignore		.gitignore
README.md		README.md
air.toml		air.toml
ipeds-mirror.Rproj		ipeds-mirror.Rproj
justfile		justfile
main.R		main.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IPEDS Mirror

Why

Usage

Requirements

Internals

Similar Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IPEDS Mirror

Why

Usage

Requirements

Internals

Similar Projects

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages