-
Notifications
You must be signed in to change notification settings - Fork 9
[Issue]: Data model best practices #81
Description
Issue Title
Standardize requirements for GA4GH data models
Issue Type
Product Harmonization
Problem Statement
GA4GH products include many different data models but lacks guidance about the representation of GA4GH data models and what artifacts should be included in an approved data model.
The purpose of the DaMaSC best practices is to discuss potential guidelines for representing GA4GH data models. The intent of these guidelines is to facilitate sharing data models between GA4GH workstreams and to improve usability of GA4GH data models. They address what should be included in a data model (content), best forms of representation, and the key ingredients for publication of a data model.
Individual workstreams determine the domain-specific contents of their data models. DaMaSC has been developing recommendations about the types of information and entities that should be present in a data model. This includes components such as human and machine-readable representations, documentation components, diagrams, and reference implementations.
Scope Validation
How does this aid harmonization of GA4GH products?
By providing a standard format for data model representation, can reduce development time for associated tooling (schema registry, validation, API development).
What barriers to organization-wide harmonization does this address?
Provides a framework for data model development so that groups can focus on the domain-specific content rather than representation (e.g. Experiments Metadata). .
Which specific alignment challenges does this solve?
For adopters of GA4GH data models, provides a harmonized framework for representation and documentation.
Does this require cross-work stream development?
Yes, for any workstream developing products that include data models.
Proposed Solution(s)
DaMaSC has created a working document based on discussions within the group and within GA4GH:
https://docs.google.com/document/d/1HWgp2gu7FpFPSF5bdx_vYbhxNx4f84OKUsA5M-hsdfs/edit?tab=t.0
This document was informed by a survey in 2024, a "Roadshow" series of presentations at GA4GH workstreams in 2025 and a GA4GH Connect session in 2025. Notes from the roadshow presentations:
https://docs.google.com/document/d/1F7CN1SOcJ7qnvGWmjXJVxNH_1qrIUKCQi69SQV2vlsY/edit?tab=t.0
We note that this is the most work-in-progress of the three DaMaSc initiatives (the others being schema registry #79 and Ontology recommendations #80 )
Estimated Effort Level
Medium (3-6 months, moderate resources)
Success Criteria
If a standard such as this is adopted, all GA4GH approved data models will have consistent human and machine readable representations, use a set of preferred ontologies for terms, and have documentation that allows for easy adoption.
How will this issue aid GA4GH harmonization?
How does this aid harmonization of GA4GH products?
By providing a standard format for data model representation, can reduce development time for associated tooling (schema registry, validation, API development).
What barriers to organization-wide harmonization does this address?
Provides a framework for data model development so that groups can focus on the domain-specific content rather than representation (e.g. Experiments Metadata). .
Which specific alignment challenges does this solve?
For adopters of GA4GH data models, provides a harmonized framework for representation and documentation.
Does this require cross-work stream development?
Yes, for any workstream developing products that include data models.
Additional context
Please provide any additional pieces of information you feel is relevant to this issue
Work Streams Raising This Issue
- Clinical & Phenotypic Data (Clin/Pheno)
- Cloud Work Stream
- Data Security
- Data Use & Researcher IDs (DURI)
- Discovery
- Genomic Knowledge Standards (GKS)
- Large Scale Genomics (LSG)
- Regulatory & Ethics (REWS)
- Data Models & Schemas Committee (DaMaSC)
- Genomic Implementation Forum (GIF)
- Technical Team
- Other (specify below)
Other Groups Raising This Issue
No response
Work Streams That Will Be Impacted
- Clinical & Phenotypic Data (Clin/Pheno)
- Cloud Work Stream
- Data Security
- Data Use & Researcher IDs (DURI)
- Discovery
- Genomic Knowledge Standards (GKS)
- Large Scale Genomics (LSG)
- Regulatory & Ethics (REWS)
- Data Models & Schemas Committee (DaMaSC)
- Genomic Implementation Forum (GIF)
- Technical Team
- Other (specify below)
Other Groups That Will Be Impacted
No response
Key Stakeholders to Consult
No response
Products affected
No response
Additional Context
No response
Priority Level
None
Additional Tags
- Documentation
- API
- Schema
- Security
- Performance
- Interoperability
- Compliance
- User Experience
- Infrastructure
- Testing