Build a research-ready unbalanced IPEDS institution-year panel from NCES IPEDS Access databases.
In plain terms, this repo turns the yearly IPEDS Access databases into one institution-by-year panel and leaves behind enough QA evidence that you can inspect the result instead of trusting a black box.
If you are opening this repo cold, start with this:
- the repo holds the code, docs, and small tracked artifacts
IPEDSDB_ROOTholds the real downloads, outputs, and QA filesbash manual_commands.shis the normal full build entrypointbash Scripts/QA_QC/qc_only.shis the fastest way to check an existing build
This repository is code-first and data-outside-git by design:
- Unit of analysis:
UNITIDbyyear - Default coverage in this repo:
2004:2023 - Release policy:
FinalAccess releases only - Upstream input: annual IPEDS Access databases, not flat component files
- Canonical final output:
panel_clean_analysis_2004_2023.parquet
| Goal | Start here |
|---|---|
| Run the whole pipeline | bash manual_commands.sh |
| Test the setup without a full historical build | python Scripts/00_run_all.py --years "2022:2023" --run-cleaning --run-qaqc |
| Check whether an existing build looks healthy | bash Scripts/QA_QC/qc_only.sh |
| Run the final acceptance audit only | python Scripts/QA_QC/08_acceptance_audit.py --root "$IPEDSDB_ROOT" --years "2004:2023" |
| Run saved inspection SQL and export results | python Scripts/run_saved_query.py --list |
| Browse variables that actually exist in the current panel | python Scripts/10_build_variable_browser.py ... |
| Pull only a subset of variables | python Scripts/08_build_custom_panel.py ... |
| Understand where a file came from | open Checks/, then Dictionary/, then Raw_Access_Databases/<year>/metadata/ |
| Inspect what the repo is doing | manual_commands.sh -> Scripts/00_run_all.py -> stage scripts in Scripts/01-09 |
| Item | Value |
|---|---|
| Repo path | .../Documents/GitHub/IPEDSDB_Panel |
| External data root | .../Projects/IPEDSDB_Paneling |
| Primary entrypoint | bash manual_commands.sh |
| SQL engine | DuckDB |
| Access extraction backend | mdb-tools |
| Main panel key | UNITID, year |
If the folder tree feels large, reduce it to this:
- the repo explains and runs the build
Raw_Access_Databases/stages the yearly source materialDictionary/explains what the variables meanCross_sections/holds the yearly long-form intermediate outputsPanels/holds the outputs most people actually wantChecks/tells you whether the outputs are worth trusting
You do not need to read every script or every CSV to work effectively here. In most cases:
- to run the build: use
manual_commands.sh - to inspect a finished build: start in
Checks/ - to understand the logic: read
Scripts/README.md, thenScripts/00_run_all.py
For the current verified 2004:2023 build under IPEDSDB_ROOT, the released panel is structurally sound and QA-clean.
If you only need the headline:
- acceptance audit:
39 / 39checks passed - final clean panel:
141,711rows and1,864columns - year window:
2004through2023 - key integrity:
0duplicate(UNITID, year)rows - cleaning preserved all institution-year rows
- dictionary QA reports
0unresolved duplicate, conflict, or unmapped failures
The status applies to the validated artifacts under IPEDSDB_ROOT, especially:
Panels/2004-2023/panel_long_varnum_2004_2023.parquetPanels/panel_wide_analysis_2004_2023.parquetPanels/panel_clean_analysis_2004_2023.parquet
See the full release evidence and metrics
That release status is based on generated audit artifacts, not just on the fact that the code ran:
Checks/acceptance_qc/acceptance_summary.csvandacceptance_summary.mdpassChecks/panel_qc/panel_qa_summary.csvshows row preservation and zero suspicious flagsChecks/panel_qc/panel_structure_summary.csvandidentifier_linkage_summary.csvdocument unbalancedness and identifier continuityChecks/panel_qc/component_timing_reference.csv,finance_comparability_summary.csv, andclassification_stability_summary.csvcover key comparability cautionsChecks/dictionary_qc/dictionary_qaqc_summary.csvshows zero unresolved duplicate/conflict/unmapped dictionary failuresMETHODS_PRCH_CLEANING.mddocuments the parent-child cleaning methodMETHODS_PANEL_CONSTRUCTION.mddocuments the full panel-construction method
| Metric | Current value | Why it matters |
|---|---|---|
| Acceptance audit | 39 / 39 checks passed |
top-level release gate over the live generated artifacts |
| Repo tests | 44 passed |
regression coverage over core build and QA paths |
| Final clean panel rows | 141,711 |
confirms delivered panel size |
| Final clean panel columns | 1,864 |
confirms delivered schema width |
| Year coverage | 20 years, 2004 through 2023 |
confirms requested panel window |
| Distinct institutions | 10,421 UNITIDs |
confirms panel population size |
| Always-present institutions | 4,395 |
institutions observed across the full window |
| Intermittent-gap institutions | 146 |
institutions with internal reporting gaps |
| Possible selection-risk institutions | 4,343 |
makes entry, exit, and gap-related attrition visible |
| Identifier-linkage review cases | 973 UNITIDs with multiple observed OPEID values |
flags continuity cases for review |
| Raw vs clean row preservation | 141,711 raw and 141,711 clean |
confirms cleaning did not drop rows |
Duplicate (UNITID, year) keys |
0 in raw wide, 0 in clean wide |
confirms one-row-per-institution-year integrity |
| Long-panel key nulls | 0 null/blank year, UNITID, varnumber, or source_file |
confirms stitched long-key integrity |
| PRCH flags evaluated | 15 observed flags |
confirms panel QA covers the observed PRCH surface |
| Suspicious PRCH flags | 0 |
confirms no flagged parent-child leakage remains |
| Dictionary lake rows | 66,702 |
confirms stitched metadata coverage |
| Dictionary code-label rows | 208,339 |
confirms category/code-label coverage |
| Dictionary duplicate/conflict/unmapped failures | 0 duplicate rows, 0 source-file conflicts, 0 varnumber collisions, 0 unmapped rows, 0 needs-review rows |
confirms dictionary integrity |
| Discrete-conflict QA | 254 grouped rows, 0 high-signal groups |
confirms remaining conflicts look like expected patterns |
These numbers come from the current generated QA artifacts and panel files under IPEDSDB_ROOT. If you rebuild the pipeline, rerun:
bash Scripts/QA_QC/qc_only.shand refresh the acceptance artifacts before treating the new build as release-ready.
These references shape the repo's construction logic, QA design, and interpretation cautions. Most users do not need them on a first read.
If you want the repo's applied version, start with:
METHODS_PANEL_CONSTRUCTION.mdMETHODS_PRCH_CLEANING.md
See the full reference list
- Jaquette, O., & Parra, E. E. (2014). Using IPEDS for Panel Analyses: Core Concepts, Data Challenges, and Empirical Applications. In M. B. Paulsen (Ed.), Higher Education: Handbook of Theory and Research (Vol. 29, pp. 467-533). Springer. https://doi.org/10.1007/978-94-017-8005-6_11
- Kelchen, R. (2019). Merging Data to Facilitate Analyses. New Directions for Institutional Research, 2019. https://doi.org/10.1002/ir.20298
- Jaquette, O., & Parra, E. (2016). The Problem with the Delta Cost Project Database. Research in Higher Education, 57(5), 630-651. https://doi.org/10.1007/s11162-015-9399-2
- Cheslock, J. J., & Shamekhi, Y. (2020). Decomposing financial inequality across U.S. higher education institutions. Economics of Education Review, 78, 102035. https://doi.org/10.1016/j.econedurev.2020.102035
- Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press.
- Wooldridge, J. M. (2019). Correlated random effects models with unbalanced panels. Journal of Econometrics, 211(1), 137-150. https://doi.org/10.1016/j.jeconom.2018.12.010
- Baltagi, B. H. (2021). Econometric Analysis of Panel Data (6th ed.). Springer. https://doi.org/10.1007/978-3-030-53953-5
- NCES. IPEDS Access Databases. https://nces.ed.gov/ipeds/use-the-data/download-access-database
- NCES. Survey Components. https://nces.ed.gov/ipeds/survey-components
- NCES. Data Literacy/Data Use Training (DLDT). https://nces.ed.gov/ipeds/use-the-data/dldt
- NCES. Institutional Groupings. https://nces.ed.gov/ipeds/about-ipeds-data/institutional-groupings
- NCES. Reporting Finance Data for Multiple Institutions. https://nces.ed.gov/ipeds/report-your-data/data-tip-sheet-reporting-finance-data-multiple-institutions
Think of the split this way:
- the repo holds code, tests, notes, and small tracked reference files
IPEDSDB_ROOTholds the real working data: downloads, extracted tables, parquet panels, QA summaries, and DuckDB build state
That keeps the git repo readable while still leaving a full local audit trail behind each run.
The orchestration path is:
- download final Access archives and companion documentation
- extract one Access database per year and export tables to CSV
- build metadata dictionaries from Access metadata tables
- harmonize yearly long files keyed by
UNITID,year,varnumber,source_file - stitch the long panel
- build the wide analysis panel in DuckDB
- apply PRCH cleaning
- emit QA/QC artifacts and optional custom outputs
The scripts create this structure automatically. The idea is simple: if something goes wrong, you should be able to open the local folders and see where the pipeline got to, what it wrote, and what it thought was healthy.
When you come back to this repo after a while, these five places usually get you oriented fastest:
README.mdScripts/README.mdScripts/00_run_all.pyChecks/acceptance_qc/acceptance_summary.mdPanels/panel_clean_analysis_2004_2023.parquet
| Path | Purpose |
|---|---|
README.md |
operator guide |
manual_commands.sh |
one-command local run |
requirements.txt |
Python dependencies |
Scripts/ |
pipeline stages and utilities |
Scripts/QA_QC/ |
QA, parity, and repo guards |
Artifacts/ |
small tracked reference files and guide figures |
Customize_Panel/selectedvars.txt |
starter variable list for custom extracts |
docs/ |
GitHub Pages publish directory for the public variable-browser snapshot, landing page, and user guide |
Queries/ |
saved starter SQL for DuckDB inspection |
tests/ |
lightweight parser and metadata tests |
Set:
export IPEDSDB_ROOT="/Users/markjaysonfarol13/Projects/IPEDSDB_Paneling"If IPEDSDB_ROOT is unset, the scripts default to that same path.
Top-level folders created under IPEDSDB_ROOT:
Raw_Access_Databases/Dictionary/Cross_sections/Panels/Checks/build/
Before starting a long build, check these five things:
- You are inside the repo:
.../Documents/GitHub/IPEDSDB_Panel - The repo-local venv is active:
source .venv/bin/activate IPEDSDB_ROOTpoints to the external local folder you want to usemdb-tables,mdb-schema, andmdb-exportare available onPATH- You are comfortable with the pipeline downloading and writing large files under
IPEDSDB_ROOT
If you are unsure whether your environment is ready, the fastest honest check is:
bash Scripts/QA_QC/qc_only.shThat does not rebuild the full panel, but it does confirm that the current generated artifacts are readable and that the QA layer still agrees with them.
cd /Users/markjaysonfarol13/Documents/GitHub/IPEDSDB_Panel
python3 -m venv .venv
source .venv/bin/activatepython -m pip install -r requirements.txtwhich mdb-tables mdb-schema mdb-exportThe extraction stage will stop immediately if any of those binaries are missing.
cd /Users/markjaysonfarol13/Documents/GitHub/IPEDSDB_Panel
source .venv/bin/activate
export IPEDSDB_ROOT="/Users/markjaysonfarol13/Projects/IPEDSDB_Paneling"
bash manual_commands.shThis wrapper is the normal “just run the whole thing” path. It:
- activates
.venvif it exists - checks
mdb-tools - creates
IPEDSDB_ROOTif needed - runs the full
2004:2023build - runs cleaning and QA
What to expect from a full run:
- the download stage writes one
manifest.csvper year - the extraction stage creates one CSV per Access table
- the dictionary and QA stages create many readable CSV summaries
- the largest final artifacts are parquet files in
Panels/ - a full
2004:2023run is materially heavier than a one-year smoke test
If you are wondering whether the build is “stuck,” the best places to look are:
- the current terminal output
Checks/logs/- the newest files appearing under the active year in
Raw_Access_Databases/ - the newest summary CSV written under
Checks/
Use at least two years if you want Stage 07 cleaning and panel QA. The cleaner intentionally refuses a single-year input because the cleaned release product is meant to be a true panel, not a one-year slice.
cd /Users/markjaysonfarol13/Documents/GitHub/IPEDSDB_Panel
source .venv/bin/activate
export IPEDSDB_ROOT="/Users/markjaysonfarol13/Projects/IPEDSDB_Paneling"
python Scripts/00_run_all.py \
--root "$IPEDSDB_ROOT" \
--years "2022:2023" \
--run-cleaning \
--run-qaqcIf you only want to validate acquisition through wide-build on a single year, skip the cleaning layer:
python Scripts/00_run_all.py \
--root "$IPEDSDB_ROOT" \
--years "2023:2023"python Scripts/00_run_all.py \
--root "$IPEDSDB_ROOT" \
--years "2004:2023" \
--run-cleaning \
--run-qaqc \
--dry-runAfter a full run, the main files to inspect are:
$IPEDSDB_ROOT/Panels/2004-2023/panel_long_varnum_2004_2023.parquet
$IPEDSDB_ROOT/Panels/panel_wide_analysis_2004_2023.parquet
$IPEDSDB_ROOT/Panels/panel_clean_analysis_2004_2023.parquet
What each one means:
| File | Meaning |
|---|---|
panel_long_varnum_2004_2023.parquet |
stitched long panel at the variable-row level |
panel_wide_analysis_2004_2023.parquet |
wide analysis panel before PRCH cleaning |
panel_clean_analysis_2004_2023.parquet |
final cleaned analysis-ready panel |
Supporting outputs that are often useful during debugging:
$IPEDSDB_ROOT/Dictionary/dictionary_lake.parquet
$IPEDSDB_ROOT/Dictionary/dictionary_codes.parquet
$IPEDSDB_ROOT/Checks/download_qc/release_inventory.csv
$IPEDSDB_ROOT/Checks/extract_qc/table_inventory_all_years.csv
$IPEDSDB_ROOT/Checks/dictionary_qc/dictionary_qaqc_summary.csv
$IPEDSDB_ROOT/Checks/panel_qc/panel_qa_summary.csv
$IPEDSDB_ROOT/Checks/panel_qc/panel_qa_coverage_matrix.csv
$IPEDSDB_ROOT/Checks/acceptance_qc/acceptance_summary.csv
If you want the fastest sanity check after a run, open these first in roughly this order:
Checks/download_qc/release_inventory.csvChecks/extract_qc/table_inventory_all_years.csvChecks/dictionary_qc/dictionary_qaqc_summary.csvChecks/panel_qc/panel_qa_coverage_matrix.csvChecks/acceptance_qc/acceptance_summary.mdPanels/panel_clean_analysis_2004_2023.parquet
The wide-build stage persists a DuckDB build database here:
$IPEDSDB_ROOT/build/ipedsdb_build.duckdb
Use the saved-query runner when you want repeatable inspection results without having to remember where the DuckDB file lives or how the common artifact views are named.
List starter queries:
python Scripts/run_saved_query.py --listRun one saved query:
python Scripts/run_saved_query.py 01_clean_panel_rows_by_yearWhat the query runner does:
- opens an in-memory DuckDB inspection session
- attaches the persisted build database when it exists
- exposes stable
inspect.*views over the standard panel, dictionary, QA, and release-inventory artifacts - writes a timestamped result folder under
Checks/query_results/
Query-result folders contain:
result.csvorresult.parquetquery.sqlquery_run.jsonpreview.txt
If you use Data Wrangler, it is most useful on:
Checks/query_results/*/result.csv- QA CSV summaries in
Checks/ - year-level metadata CSVs in
Raw_Access_Databases/<year>/metadata/
It is not the main execution interface for this repo. Think of it as a convenience layer for inspection, not the source of truth for the build.
If you want a low-friction habit, use this order:
- run saved SQL with
run_saved_query.py - open the exported
result.csv - use Data Wrangler on that CSV instead of pointing it at the full build directly
| Stage | Script | What it does | Main outputs |
|---|---|---|---|
| Download | Scripts/01_download_access_databases.py |
Scrapes the NCES Access page and downloads final-only yearly archives plus companion workbooks | Raw_Access_Databases/<year>/manifest.csv, Checks/download_qc/ |
| Extract | Scripts/02_extract_access_db.py |
Unzips the Access DB, exports each table to CSV, and classifies tables | tables_csv/, metadata/table_inventory.csv, metadata/table_columns.csv |
| Dictionary | Scripts/03_dictionary_ingest.py |
Builds dictionary lake and code-label tables from Access metadata | Dictionary/dictionary_lake.parquet, Dictionary/dictionary_codes.parquet |
| Harmonize | Scripts/04_harmonize.py |
Converts exported data tables into long parquet with metadata attached | Cross_sections/panel_long_varnum_<year>.parquet, Checks/harmonize_qc/ |
| Stitch | Scripts/05_stitch_long.py |
Combines yearly long outputs into one stitched panel | Panels/2004-2023/panel_long_varnum_2004_2023.parquet |
| Wide build | Scripts/06_build_wide_panel.py |
Uses DuckDB to build the wide analysis panel and related QC | Panels/panel_wide_analysis_2004_2023.parquet, Checks/wide_qc/, Checks/disc_qc/ |
| Clean | Scripts/07_clean_panel.py |
Applies PRCH child-row cleaning while preserving all UNITID-year rows |
Panels/panel_clean_analysis_2004_2023.parquet, Checks/prch_qc/ |
| Custom extract | Scripts/08_build_custom_panel.py |
Creates a smaller panel with selected columns | custom .parquet or .csv |
| Panel dictionary | Scripts/09_build_panel_dictionary.py |
Builds a dictionary tied to actual wide-panel columns | panel-level dictionary .csv or .xlsx |
The pipeline writes a lot of CSV on purpose. The goal is that you can answer “what happened?” by opening a few readable summaries instead of jumping straight into parquet inspection.
The practical reading order is:
acceptance_qc/for the top-level pass/fail decisionpanel_qc/for row-preservation and PRCH behaviordictionary_qc/for metadata and source-family issueswide_qc/orharmonize_qc/only if one of the higher-level summaries points there
Most useful QA directories:
| Directory | What to inspect first |
|---|---|
Checks/download_qc/ |
release_inventory.csv, missing_years.csv, download_failures.csv |
Checks/extract_qc/ |
table_inventory_all_years.csv, extract_failures.csv |
Checks/dictionary_qc/ |
dictionary_qaqc_summary.csv, unmapped_metadata_tables.csv, noncanonical_source_categories.csv |
Checks/harmonize_qc/ |
yearly harmonize_summary_*.csv, dropped UNITID reports |
Checks/release_qc/ |
yearly release summaries confirming final |
Checks/wide_qc/ |
scalar-conflict and wide-build reports |
Checks/disc_qc/ |
disc_conflicts_summary_all_years.csv first, then year-level detail only if needed |
Checks/prch_qc/ |
prch_clean_summary.csv, prch_clean_columns.csv, prch_flag_policy.csv |
Checks/panel_qc/ |
panel_qa_summary.csv, panel_qa_coverage_matrix.csv, panel_structure_summary.csv, identifier_linkage_summary.csv, classification_stability_summary.csv |
Checks/acceptance_qc/ |
acceptance_summary.csv, acceptance_summary.md |
Checks/query_results/ |
saved-query outputs for inspection and Data Wrangler |
Checks/real_parity_runs/summary/ |
cross-run task-monitor CSV and Markdown summaries |
Run QA only against existing outputs:
bash Scripts/QA_QC/qc_only.shThat wrapper now runs:
00_dictionary_qaqc.py01_panel_qa.py09_panel_structure_qc.py08_acceptance_audit.py
The acceptance audit is the final “is this build release-ready?” check over the live generated artifacts under IPEDSDB_ROOT.
Run it directly:
python Scripts/QA_QC/08_acceptance_audit.py \
--root "$IPEDSDB_ROOT" \
--years "2004:2023"It writes:
$IPEDSDB_ROOT/Checks/acceptance_qc/acceptance_summary.csv
$IPEDSDB_ROOT/Checks/acceptance_qc/acceptance_summary.md
It checks:
- required panel, dictionary, and QA artifacts exist
- exact
2004:2023year coverage - no duplicate
(UNITID, year)keys in wide or clean outputs - long-panel key fields are non-null and non-blank
- raw and clean row counts match
- dictionary QA has no unresolved duplicate/conflict/unmapped rows
- panel QA has no suspicious PRCH flags
- discrete-conflict QA has no remaining high-signal groups
The authoritative method note for parent-child cleaning is:
METHODS_PRCH_CLEANING.md
The authoritative full-construction method note is:
METHODS_PANEL_CONSTRUCTION.md
Current repo policy is intentionally component-specific:
- keep every
UNITID-yearrow - null only the component-family columns affected by a
PRCH_*flag - for Finance, clean
PRCH_Fcodes2,3,4,5 - retain
PRCH_F=6as a review-only partial case because blanket nulling would erase valid reported finance values
The important practical point is that the cleaned panel is row-preserving, not institution-collapsing.
| Signal | What you want to see |
|---|---|
| Release coverage | requested years exist and are marked Final |
| Extraction | one Access DB per year and a non-empty table_inventory.csv |
| Dictionary | low duplicate/conflict counts and a mostly categorized noncanonical source report |
| Harmonization | no fatal UNITID issues and expected yearly summaries |
| Wide build | panel_wide_analysis_2004_2023.parquet exists and QA files are written |
| Final clean panel | panel_clean_analysis_2004_2023.parquet exists, panel_qa_summary.csv shows row preservation, and panel_qa_coverage_matrix.csv has no unexplained suspicious flags |
| Acceptance audit | Checks/acceptance_qc/acceptance_summary.csv is all PASS |
You should not need to open every QA directory when a run looks healthy. In the normal case, acceptance_qc/ and panel_qc/ are enough to tell you whether deeper inspection is necessary.
For structure-sensitive work, the most informative new files are:
Checks/panel_qc/panel_structure_summary.csvChecks/panel_qc/entry_exit_gap_summary.csvChecks/panel_qc/identifier_linkage_summary.csvChecks/panel_qc/classification_stability_summary.csvChecks/panel_qc/finance_comparability_summary.csv
Check these in order:
- terminal output from the failing script
Checks/download_qc/download_failures.csvChecks/extract_qc/extract_failures.csvChecks/dictionary_qc/dictionary_qaqc_summary.csvChecks/harmonize_qc/Checks/wide_qc/Checks/panel_qc/panel_qa_coverage_matrix.csv
Common failure patterns:
| Problem | Likely place to look |
|---|---|
| download failed | network access, NCES page changes, download_failures.csv |
| extraction failed | mdb-tools, malformed zip, extract_failures.csv |
| missing metadata roles | yearly metadata/table_inventory.csv |
missing UNITID fatal error |
exported CSV table in Raw_Access_Databases/<year>/tables_csv/ |
| weird wide-panel behavior | Checks/wide_qc/, Checks/disc_qc/disc_conflicts_summary_all_years.csv, dictionary mapping |
If you feel lost, go back to the last stage that clearly succeeded and open that stage’s summary CSV before diving into lower-level files.
python Scripts/08_build_custom_panel.py \
--input "$IPEDSDB_ROOT/Panels/panel_clean_analysis_2004_2023.parquet" \
--output "$IPEDSDB_ROOT/Panels/custom_panel_2004_2023.parquet" \
--vars-file "Customize_Panel/selectedvars.txt" \
--years "2004:2023"python3 Scripts/10_build_variable_browser.py \
--input "$IPEDSDB_ROOT/Panels/panel_clean_analysis_2004_2023.parquet" \
--dictionary "$IPEDSDB_ROOT/Dictionary/dictionary_lake.parquet" \
--output "Customize_Panel/variable_browser.html"That writes a single static HTML file you can open locally. The browser only lists columns that actually exist in the supplied panel schema, groups them into analyst-facing semantic families using dictionary metadata plus lightweight heuristics, surfaces coverage and completeness badges, and lets you:
- search and filter only real panel columns
- inspect descriptions, coverage, and related variables
- save, import, and reuse variable lists with validation
- export
selectedvars.txtand, if needed, a small manifest
python Scripts/09_build_panel_dictionary.py \
--input "$IPEDSDB_ROOT/Panels/panel_clean_analysis_2004_2023.parquet" \
--dictionary "$IPEDSDB_ROOT/Dictionary/dictionary_lake.parquet" \
--output "$IPEDSDB_ROOT/Panels/panel_clean_analysis_2004_2023_dictionary.csv"For a formatted Excel workbook instead of CSV:
python Scripts/09_build_panel_dictionary.py \
--input "$IPEDSDB_ROOT/Panels/panel_clean_analysis_2004_2023.parquet" \
--dictionary "$IPEDSDB_ROOT/Dictionary/dictionary_lake.parquet" \
--output "$IPEDSDB_ROOT/Panels/panel_clean_analysis_2004_2023_dictionary.xlsx"python Scripts/QA_QC/05_repo_size_guard.py
python Scripts/QA_QC/06_staged_repo_guard.pyPYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python -m pytest -qUse that exact command in this environment. Plain python -m pytest can hang during plugin autoload before test collection begins.
python Scripts/QA_QC/03_monitored_analysis_build.py \
--input "$IPEDSDB_ROOT/Panels/2004-2023/panel_long_varnum_2004_2023.parquet" \
--dictionary "$IPEDSDB_ROOT/Dictionary/dictionary_lake.parquet"That workflow now refreshes:
$IPEDSDB_ROOT/Checks/real_parity_runs/summary/task_monitor_summary.csv
$IPEDSDB_ROOT/Checks/real_parity_runs/summary/task_monitor_summary.md
| Term | Meaning in this repo |
|---|---|
| Access DB | the yearly NCES IPEDS Microsoft Access database |
| Dictionary lake | the stitched metadata reference built from Access metadata tables |
| Long panel | one row per UNITID-year-variable style observation |
| Wide panel | one row per UNITID-year with variables as columns |
| PRCH cleaning | parent-child handling that nulls affected component-family columns without dropping rows |
source_file |
normalized survey-family label used across harmonization and wide build |
| smoke test | a small run, typically one year, used to verify setup before a full build |
- This repo is currently configured around
2004:2023. - The workflow is
Finalrelease only. Provisional releases are intentionally excluded from the default build. UNITIDandyearare treated as the panel keys.- Access extraction uses
mdb-tools; there is no silent fallback backend. - Generated data should not be committed to git.
- No script in this repo performs a git commit or push.
This repo is:
- a reproducible panel-construction pipeline
- an audit trail over the generated outputs
- a research-facing cleaned panel build with documented parent-child handling
This repo is not:
- a notebook-first exploration project
- a universal merger-history engine
- a substitute for reading the QA outputs
- a promise that every future rebuild is healthy unless you rerun the acceptance checks
If you are new to the repo, this order is the fastest way to get oriented without bouncing around:
README.mdmanual_commands.shScripts/00_run_all.pyScripts/01_download_access_databases.pyScripts/02_extract_access_db.pyScripts/03_dictionary_ingest.pyScripts/04_harmonize.pyScripts/06_build_wide_panel.py
That path mirrors the actual data flow and gets you from acquisition to the final cleaned panel with the least context switching.