This project is a set of R scripts that will automatically download, unzip, and mirror the Access database files from IPEDS to a DuckDB database.
Doing time series analysis with IPEDS data can be annoying. You have to download a file for each year you need. That can turn into multiple files per year if you are utilizing variables that are not stored in the same datafile. If you instead download the Access database files, you then have these large files that are not in a very flexible format.
DuckDB is a SQL database that is very flexible, compatible with most popular programming languages.
For 21 years (2004-2024) of data, the ZIP files sit at 1.24 GB, the unzipped Access files at over 10 GB, and the resulting DuckDB file at 2.2 GB.
This project uses just, but it really isn't needed.
just processYou can just as easily run the script directly with:
Rscript main.RThis uses a handful of R libraries.
install.packages(c(
"tidyverse",
"httr2",
"rvest",
"duckdb"
))Each file within the R directory contains only functions which are called by main.R.
check.R: Checks the IPEDS website, the any existing ZIP files, and any existing DuckDB database.download.R: Downloads any missing ZIP files.access.R: Unzip and create ODBC connections for each of the Access files.metadata.R: Pull some of the metadata from the Access files.mirror.R: Extract all of the data tables from the Access files and appends them across time, writing/appending them to the DuckDB file.
The main.R file loads the libraries, opens the DuckDB file, sources the files within R, and calls their functions.
The result is a database file, mirror.duckdb, which contains each table with all the years concatenated.
So for instance, the HD table will have all years and columns with an additional identifier of year.
The table names have the year portion of their name removed so the C2020A table is part of the C_A table.
The metadata is found in the tables, variables, and value_sets tables.
I've built this with some attempts at optimizing for speed, but I'm sure the process could be way faster.
(I haven't tried these projects. They are just what I found was a search on GitHub.)
paulgp/ipeds-database- Python, DuckDB
stanislavzza/IPEDSR- R package, DuckDB
- Seems to have the same general idea with a very different approach and different goals.
jbryer/ipeds- R package, last updated in 2023
{IPEDSuploadable}looks cool but is for submitting data to IPEDS, so a different objective.- There are others that haven't been updated in many years or seem like a one-off data project.