Active Define transforms define.xml from a passive documentation artifact into an executable semantic graph.
Instead of hardcoding SQL or Python filters for every clinical trial (e.g., WHERE VSTESTCD='DIABP'), this engine reads the XML metadata contract and compiles validation logic dynamically.
graph LR
A[define.xml] -->|Compiler| B(Semantic Graph)
C[Raw Data .xpt] -->|Engine| D{Auto-Filter}
B --> D
D --> E[Validated Subsets]
This project follows a production-grade modular architecture:
active-define/
│
├── data/ # Raw inputs (define.xml, vs.xpt)
├── src/ # Core Engine Logic
│ └── active_define/ # The Python Package (Parser, Transpiler)
├── scripts/ # Utility Tools
│ ├── setup.py # Auto-provisions CDISC Pilot 01 test data
│ ├── visualize.py # Generates Mermaid.js logic diagrams
│ └── export.py # Compiles XML to clean JSON
├── notebooks/ # Interactive Demos
├── run_demo.py # CLI Entry Point
└── compiled_metadata.json # Artifact: The compiled logic schema
Install dependencies and download the official CDISC Pilot 01 dataset.
pip install -r requirements.txt
python scripts/setup.pyExecute the semantic graph against the raw Vital Signs data.
python run_demo.pyOutput: The engine will identify 6 unique Vital Sign definitions in the XML and automatically slice the 29,000+ row dataset into validated cohorts without hardcoded filters.
Visualize the logic extracted from the XML.
python scripts/visualize.pyIn traditional Clinical Programming, we manually write code that duplicates the logic in define.xml.
Active Define proves that metadata can be Infrastructure as Code—driving the ingestion process automatically.
