Skip to content

Steven314/ipeds-mirror

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IPEDS Mirror

This project is a set of R scripts that will automatically download, unzip, and mirror the Access database files from IPEDS to a DuckDB database.

Why

Doing time series analysis with IPEDS data can be annoying. You have to download a file for each year you need. That can turn into multiple files per year if you are utilizing variables that are not stored in the same datafile. If you instead download the Access database files, you then have these large files that are not in a very flexible format.

DuckDB is a SQL database that is very flexible, compatible with most popular programming languages.

For 21 years (2004-2024) of data, the ZIP files sit at 1.24 GB, the unzipped Access files at over 10 GB, and the resulting DuckDB file at 2.2 GB.

Usage

This project uses just, but it really isn't needed.

just process

You can just as easily run the script directly with:

Rscript main.R

Requirements

This uses a handful of R libraries.

install.packages(c(
    "tidyverse",
    "httr2",
    "rvest",
    "duckdb"
))

Internals

Each file within the R directory contains only functions which are called by main.R.

  • check.R: Checks the IPEDS website, the any existing ZIP files, and any existing DuckDB database.
  • download.R: Downloads any missing ZIP files.
  • access.R: Unzip and create ODBC connections for each of the Access files.
  • metadata.R: Pull some of the metadata from the Access files.
  • mirror.R: Extract all of the data tables from the Access files and appends them across time, writing/appending them to the DuckDB file.

The main.R file loads the libraries, opens the DuckDB file, sources the files within R, and calls their functions.

The result is a database file, mirror.duckdb, which contains each table with all the years concatenated. So for instance, the HD table will have all years and columns with an additional identifier of year. The table names have the year portion of their name removed so the C2020A table is part of the C_A table. The metadata is found in the tables, variables, and value_sets tables.

I've built this with some attempts at optimizing for speed, but I'm sure the process could be way faster.

Similar Projects

(I haven't tried these projects. They are just what I found was a search on GitHub.)

About

A set of scripts to ease collection of IPEDS data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors