This data model is in active development. It builds on HTAN Phase 1 and incorporates input from the Cancer Data Standards (CDS) initiative. Expect frequent changes until a stable version is released.
This repository is part of ongoing efforts to refine and standardize the HTAN2 data model.
π Documentation: Full documentation is available at https://htan2-data-model.readthedocs.io/en/main/
The HTAN2 data model is built using LinkML, a modeling language for schemas that generates Python data model classes and JSON schemas. The model follows a modular architecture with clear separation of concerns:
The diagram above illustrates the separation between Record-Based Modules (Clinical, Biospecimen) and File-Based Modules (WES, Digital Pathology, etc.), with the Core File Module providing universal attributes for all file-based modules.
- Purpose: Universal attributes shared across all file-based modules
- Location:
modules/CoreFile/domains/core.yaml - Key Features:
- Single primary key definition (
HTAN_DATA_FILE_ID) - Required field definitions for relationships
- HTAN identifier validation patterns
- Base class for inheritance (
CoreFileAttributes)
- Single primary key definition (
- Purpose: Clinical and demographic data
- Location:
modules/Clinical/ - Structure: Multiple domain files (demographics, diagnosis, therapy, etc.)
- Features: Comprehensive validation rules and conditional requirements
- Purpose: Comprehensive biospecimen metadata and classification
- Location:
modules/Biospecimen/ - Structure: 18 domain-specific enum files with medical classifications
- Features: RFC-compliant implementation with 39 core attributes, ICD-10/ICD-O-3 integration, UBERON tissue ontology
- Purpose: Base sequencing attributes shared across all sequencing types
- Location:
modules/Sequencing/ - Structure: BaseSequencingAttributes class with common sequencing metadata
- Features: Library layout enums, sequencing platform enums, workflow metadata
- Purpose: Bulk Whole Exome Sequencing data
- Location:
modules/WES/ - Structure: Three processing levels (Level 1, 2, 3)
- Features: Sequencing platform enums, quality metrics, variant calling
- Purpose: Single-cell RNA sequencing data
- Location:
modules/scRNA-seq/ - Structure: Three data levels (Level 1, 2, 3/4) with h5ad format validation
- Features: Single-cell isolation methods, workflow types, AnnData schema compliance
- Purpose: Base imaging attributes shared across all imaging modules
- Location:
modules/Imaging/ - Structure: BaseImagingAttributes class with common imaging metadata
- Features: De-identification methods, imaging equipment, microscopy parameters, quality control
- Purpose: Whole-slide imaging (WSI) data from H&E and other tissue-based assays
- Location:
modules/DigitalPathology/ - Structure: Single data level (Level 2) with Bio-Formats/OpenSlide compatible formats
- Features: Annotation support, slide label handling, CRDC alignment, format validation
- Purpose: Multiplexed tissue imaging assays (CODEX, CyCIF, IMC, MIBI, etc.)
- Location:
modules/MultiplexMicroscopy/ - Structure: Three data levels (Level 2: imaging + channel metadata, Level 3: segmentation masks, Level 4: cell-by-feature tables)
- Features: Channel metadata, image dimensions, multiplex assay types, CRDC alignment
- Purpose: Sequencing-based and sequence-hybridization spatial omics assays (Visium, Xenium, CosMx, STOmics, etc.)
- Location:
modules/SpatialOmics/ - Structure: Four data levels (Level 1: raw data bundle optional, Level 3: processed bundle required, Level 4: interoperable file optional, Panel: panel information)
- Features: Platform flexibility, bundle-level metadata, panel information, QC metrics, conditional requirements
htan2-data-model/
βββ modules/ # All data model modules
β βββ CoreFile/ # Universal file attributes
β βββ Clinical/ # Clinical data domains
β βββ Biospecimen/ # Biospecimen metadata and classification
β βββ Sequencing/ # Base sequencing attributes
β βββ Imaging/ # Base imaging attributes
β βββ WES/ # Whole Exome Sequencing
β βββ scRNA-seq/ # Single-cell RNA sequencing
β βββ DigitalPathology/ # Digital Pathology imaging
β βββ MultiplexMicroscopy/ # Multiplex Microscopy imaging
β βββ SpatialOmics/ # Spatial Omics assays
βββ config/ # LinkML configuration
βββ scripts/ # Utility scripts
βββ tests/ # Root-level tests
βββ docs/ # Documentation
Participant (HTAN_PARTICIPANT_ID)
βββ Biospecimen (HTAN_BIOSPECIMEN_ID)
β βββ Level 1 Data (HTAN_DATA_FILE_ID) β HTAN_PARENT_ID: _B####
β βββ Level 2 Data (HTAN_DATA_FILE_ID) β HTAN_PARENT_ID: _D####
β βββ Level 3 Data (HTAN_DATA_FILE_ID) β HTAN_PARENT_ID: _D####
HTAN_DATA_FILE_ID: Unique identifier for data files across all levels
HTAN_PARTICIPANT_ID: HTAN ID associated with a patientHTAN_BIOSPECIMEN_ID: HTAN Biospecimen ID of the parent biospecimen
HTAN_PARENT_ID: References parent entity using suffix convention_B####- References a biospecimen_D####- References a data file
- Python 3.11+
- Poetry (dependency management)
- LinkML tools
# Clone the repository
git clone <repository-url>
cd htan2-data-model
# Install dependencies
poetry install
# Generate schemas
make modules-gen
# Run tests
make testEach module contains detailed documentation:
- Core File Module: See
modules/CoreFile/README.mdfor primary/foreign key definitions - Clinical Module: See
modules/Clinical/README.mdfor domain descriptions - Biospecimen Module: See
modules/Biospecimen/README.mdfor RFC compliance and enum schemas - Sequencing Module: See
modules/Sequencing/README.mdfor base sequencing attributes - Imaging Module: Base imaging attributes (no separate README, see DigitalPathology and MultiplexMicroscopy)
- WES Module: See
modules/WES/README.mdfor sequencing levels - scRNA-seq Module: See
modules/scRNA-seq/README.mdfor single-cell RNA sequencing levels - Digital Pathology Module: See
modules/DigitalPathology/README.mdfor digital pathology imaging - Multiplex Microscopy Module: See
modules/MultiplexMicroscopy/README.mdfor multiplex microscopy imaging - Spatial Omics Module: See
modules/SpatialOmics/README.mdfor spatial omics assays
For detailed contribution guidelines, development conventions, and step-by-step instructions for adding new modules, see CONTRIBUTING.md.
- Fork the repository
- Create feature branch:
git checkout -b feat/[module-name] - Follow conventions in CONTRIBUTING.md
- Run tests:
make test - Submit pull request
- LinkML Documentation: https://linkml.io/
- HTAN Phase 1 (archived): https://github.com/ncihtan/data-models
- Cancer Data Standards: https://cancer.gov/cancer-data-standards
