Utilities for converting a quasi-REDCap data dictionary into a workbook that REDCap will import without warnings, including automated column normalisation, linting, LLM-assisted metadata inference, and DSL-based replay of fixes.
- Dev Container: Open in VS Code and choose Reopen in Container to get Python 3.11 with requirements preinstalled.
- Local Python: Use Python 3.11+, then install dependencies once:
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt - Minimal pipeline (run inside a scratch dir):
mkdir -p work && cd work cp ../OriginalDict.xlsx . python ../map.py OriginalDict.xlsx --out map.json # inspect map.json, add immediates, set ignore flags, then continue python ../reformat.py OriginalDict.xlsx --map map.json --out stage1.ops python ../rcmod.py --in OriginalDict.xlsx --out Stage1Dict.xlsx stage1.ops python ../redcap_lint.py Stage1Dict.xlsx --report lint.json || true python ../llm_submit.py --config ../job_infer.json \ --source lint.json --io-dir . --key-file ~/.config/openai.key python ../fix.py --dict Stage1Dict.xlsx --report lint+.json \ --output stage2.ops python ../rcmod.py --in Stage1Dict.xlsx \ --out FinalDict.xlsx stage2.ops
map.py --out ...- Input: raw XLS/XLSX (multi-sheet allowed)
- Output:
map.jsondetailing sheet mappings, required field gaps, and optional constants (immediate).
reformat.py --map ... --out ...- Input: original dictionary + curated
map.json - Output:
stage1.opsDSL script that reproduces the mapping.
- Input: original dictionary + curated
rcmod.py --in ... stage1.ops- Input: original dictionary + DSL
- Output: single-sheet
Stage1Dict.xlsxwith canonical columns.
redcap_lint.py --report lint.json- Input:
Stage1Dict.xlsx - Output: structured lint findings (
lint.json) and exit code 2 on violations.
- Input:
llm_submit.py --config job_infer.json- Input:
lint.json(source payload) plus prompt/reference files listed injob_infer.json - Output: augmented lint (
lint+.json) containing inferred field types and configurations.
- Input:
fix.py --dict Stage1Dict.xlsx --report lint+.json- Input: stage-one dictionary + augmented lint
- Output:
stage2.opsDSL with content-level fixes.
rcmod.py --in Stage1Dict.xlsx stage2.ops- Input: stage-one dictionary + content DSL
- Output:
FinalDict.xlsxready for import.
- Optional:
llm_submit.py --config job_summary.json --source stage2.opsto generate a Markdown change summary for authors and ingest staff.
End-to-end example:
python map.py Raw.xlsx --out tmp/map.json
python reformat.py Raw.xlsx --map tmp/map.json --out tmp/structure.ops
python rcmod.py --in Raw.xlsx --out tmp/Stage1.xlsx tmp/structure.ops
python redcap_lint.py tmp/Stage1.xlsx --report tmp/lint.json || true
python llm_submit.py --config job_infer.json --source tmp/lint.json \
--io-dir tmp --key-env OPENAI_API_KEY
python fix.py --dict tmp/Stage1.xlsx --report tmp/lint+.json \
--output tmp/content.ops
python rcmod.py --in tmp/Stage1.xlsx \
--out Final.xlsx tmp/content.opsScans a quasi-REDCap dictionary and produces the JSON map consumed by later steps.
usage: map.py [-h] [--out OUT_FILE] [--default-immediate CANON=VALUE]
dict_file
Generate raw→canonical column maps
positional arguments:
dict_file Excel/CSV data dictionary to scan
options:
-h, --help show this help message and exit
--out OUT_FILE Destination for the generated JSON map (default:
<DICT>-map.json)
--default-immediate CANON=VALUE
Inject default values for missing canonical columns;
repeatable
Example:
python map.py Raw.xlsx --out tmp/map.json
Applies a previously generated map to rebuild the dictionary with canonical columns.
usage: redcap_format.py [-h] --map MAP_FILE --output OUTPUT_FILE
[--elide-unlabeled]
dict_file
Apply a REDCap mapping JSON
positional arguments:
dict_file Original Excel workbook
options:
-h, --help show this help message and exit
--map MAP_FILE Path to map JSON
--output OUTPUT_FILE Destination XLSX to write
--elide-unlabeled Also drop rows lacking a Field Label
Example:
python redcap_format.py Raw.xlsx --map tmp/map.json --output tmp/Stage0.xlsx
Builds a deterministic DSL (*.ops) equivalent to applying a map.json.
usage: reformat.py [-h] [--map MAP_FILE] [--out OUT_FILE] [--elide-unlabeled]
dict_file
positional arguments:
dict_file Original Excel/CSV dictionary
options:
-h, --help show this help message and exit
--map MAP_FILE map.json produced by map.py (defaults to
<DICT>-map.json)
--out OUT_FILE Path to write the generated DSL (defaults to
<DICT>-reformat.rcm)
--elide-unlabeled Also delete rows with blank Field Label
Example:
python reformat.py Raw.xlsx --map tmp/map.json --out tmp/structure.ops
Executes DSL primitives over one or more sheets and writes the combined output workbook.
usage: rcmod.py [-h] --in INPUT_DICT --out OUTPUT_DICT ops_file
Apply DSL operations to REDCap dictionary
positional arguments:
ops_file DSL operations file
options:
-h, --help show this help message and exit
--in INPUT_DICT Original REDCap dictionary file (XLS/XLSX or CSV)
--out OUTPUT_DICT Output corrected dictionary file
Example:
python rcmod.py --in Raw.xlsx --out Stage1.xlsx structure.ops
Runs llm_submit.py for each supplied DSL (*.rcm) file and concatenates the
stage summaries into a single markdown report.
usage: summarize_rcm.py [-h] [--config CONFIG] [--rollup-config ROLLUP_CONFIG]
[--output OUTPUT] [--io-dir IO_DIR]
[--key-file KEY_FILE]
rcm_files [rcm_files ...]
Generate an aggregated summary for multiple RCM files.
positional arguments:
rcm_files Paths to *.rcm files in the order they should be
summarised
options:
-h, --help show this help message and exit
--config CONFIG Path to llm_submit job config (default: job_summary.json)
--rollup-config ROLLUP_CONFIG
Path to llm_submit job config for the rollup stage
(default: job_summary_rollup.json)
--output OUTPUT Path for the combined markdown summary (default:
combined-summary.md inside --io-dir when provided)
--io-dir IO_DIR Override llm_submit --io-dir (defaults to current working
directory)
--key-file KEY_FILE Path to OpenAI API key file to pass through to llm_submit
Example:
python summarize_rcm.py stage1.rcm fixes/stage2.rcm \
--output pipeline-summary.md
Splits the combined REDCap workbook into one workbook per form based on the
Form Name column. Files are named <basename>-<form>.xlsx and created only
when more than one form exists.
usage: split_forms.py [-h] [--output-dir OUTPUT_DIR] input
Split REDCap dictionary by form
positional arguments:
input Path to the consolidated REDCap workbook
options:
-h, --help show this help message and exit
--output-dir OUTPUT_DIR
Directory to write per-form workbooks (defaults to
input directory)
Example:
python split_forms.py data/CTN0095A1/CTN0095A1-reformatted.xlsx
Validates canonical dictionaries, emitting a JSON lint report and non-zero exit codes when violations occur.
usage: redcap_lint.py [-h] [--report REPORT_FILE] [--form-name FORM_NAME]
dict_file
Lint a REDCap data dictionary.
positional arguments:
dict_file Path to REDCap data dictionary (CSV/XLS/XLSX)
options:
-h, --help show this help message and exit
--report REPORT_FILE Write detailed JSON lint report to this path
--form-name FORM_NAME
Override every value in the 'Form Name' column
Example:
python redcap_lint.py Stage1.xlsx --report lint.json || true
Converts augmented lint output into DSL fixes for content (types, choices, validations).
usage: fix.py [-h] --dict DICT_FILE --report REPORT_FILE [-o OUT_FILE]
Compile DSL primitives from augmented report.json
options:
-h, --help show this help message and exit
--dict DICT_FILE Original REDCap dictionary (.csv or .xlsx)
--report REPORT_FILE Augmented report.json with inferred_field_type &
configuration
-o OUT_FILE, --output OUT_FILE
Path to write DSL commands (defaults to stdout)
Example:
python fix.py --dict Stage1.xlsx --report lint+.json --output content.ops
Legacy direct OpenAI submission helper that reads prompt/reference files and a JSON lint report, then concatenates chunked completions.
usage: infer_submit.py [-h] [--model MODEL] [--max-tokens MAX_TOKENS]
[--chunks CHUNKS] [--dry-run]
[--log-level {debug,info,warning,error,critical}]
[--prompt PROMPT] [--reference REFERENCE]
[--report REPORT] [--config CONFIG] [--output OUTPUT]
Submit REDCap inference prompt to the OpenAI API
Example (requires infer_config.json with an api_key field):
python infer_submit.py --report lint.json --output lint+.json
Current, configurable OpenAI job runner with auto-chunking, templated outputs, and support for JSON or text sources.
usage: llm_submit.py [-h] --config CONFIG [--source SOURCE] [--model MODEL]
[--max-tokens MAX_TOKENS] [--temperature TEMPERATURE]
[--job-name JOB_NAME] [--io-dir IO_DIR] [--dry-run]
[--key-file KEY_FILE] [--key-env KEY_ENV]
[--log-level {debug,info,warning,error,critical}]
[--output OUTPUT] [--raw]
General OpenAI submission helper (auto-chunking, io-dir)
Examples:
python llm_submit.py --config job_infer.json --source lint.json --io-dir tmp
python llm_submit.py --config job_summary.json --source stage2.ops --io-dir tmp
map.json: produced byredcap_format.py; seemap_file_format.mdfor schema details.job_infer.jsonandjob_summary.json: presets forllm_submit.pydescribing prompts, models, chunking headers, and output templates.infer_prompt.md,summary.md,system_invariants_*.md, andredcap_reference.md: prompt assets loaded by the job configs.infer_submit.pyreads API keys from--configJSON ({"api_key": "..."}).llm_submit.pyresolves API keys via--key-file,--key-env, orOPENAI_API_KEY.- Devcontainer propagates
OPENAI_API_KEYand optionalOPENAI_BASE_URLfrom the host; replicate this locally as needed.
map.py,redcap_format.py,reformat.py,rcmod.py,split_forms.py: structural mapping tools.redcap_lint.py,fix.py: linting and DSL generation for content fixes.llm_submit.py,infer_submit.py: LLM submission tooling.job_*.json,infer_prompt.md,summary.md,summary_rollup.md,system_invariants_*.md: prompt definitions and job presets.redcap_reference.md,map_file_format.md,redcap_convert_dsl.md: reference documentation for REDCap columns and DSL primitives..devcontainer/: Python 3.11 container definition (Debian Bookworm).requirements.txt: pandas, openpyxl, openai, tiktoken runtime deps.save/: archival copies of earlier scripts (for comparison only).
- New
map.pyisolates JSON map generation;redcap_format.pynow only applies existing maps. apply_dsl.py(seesave/rfi.py) has been superseded byrcmod.py, which no longer requires--mapat runtime.compile_fixes.pywas renamed tofix.pyand now always ensuresSection Headerexists before applying row fixes.infer_submit.pydroppedmap.jsonsupport; the modern replacement isllm_submit.pyplusjob_infer.json.- New job workflow:
llm_submit.py+job_summary.jsonproduces an author-facing report from the generated DSL.
redcap_format.py --mapexits if--outputis omitted; pass an explicit path when normalising.rcmod.pyonly accepts XLS/XLSX inputs for multi-sheet processing; CSV sources must be converted first.llm_submit.pyaborts if the config embedsapi_key,chunks, orsource; provide those at runtime instead.- OpenAI quota errors show a usage summary via
quota_utils.summarize_usagewhen available; check billing limits before retrying.
Contributions are welcome via pull request; document any new primitives or job configs alongside code changes. Licensed under the MIT License.