Skip to content

[Issue]: Data model best practices #81

@kcranston

Description

@kcranston

Issue Title

Standardize requirements for GA4GH data models

Issue Type

Product Harmonization

Problem Statement

GA4GH products include many different data models but lacks guidance about the representation of GA4GH data models and what artifacts should be included in an approved data model.

The purpose of the DaMaSC best practices is to discuss potential guidelines for representing GA4GH data models. The intent of these guidelines is to facilitate sharing data models between GA4GH workstreams and to improve usability of GA4GH data models. They address what should be included in a data model (content), best forms of representation, and the key ingredients for publication of a data model.

Individual workstreams determine the domain-specific contents of their data models. DaMaSC has been developing recommendations about the types of information and entities that should be present in a data model. This includes components such as human and machine-readable representations, documentation components, diagrams, and reference implementations.

Scope Validation

How does this aid harmonization of GA4GH products?
By providing a standard format for data model representation, can reduce development time for associated tooling (schema registry, validation, API development).

What barriers to organization-wide harmonization does this address?
Provides a framework for data model development so that groups can focus on the domain-specific content rather than representation (e.g. Experiments Metadata). .

Which specific alignment challenges does this solve?
For adopters of GA4GH data models, provides a harmonized framework for representation and documentation.

Does this require cross-work stream development?
Yes, for any workstream developing products that include data models.

Proposed Solution(s)

DaMaSC has created a working document based on discussions within the group and within GA4GH:

https://docs.google.com/document/d/1HWgp2gu7FpFPSF5bdx_vYbhxNx4f84OKUsA5M-hsdfs/edit?tab=t.0

This document was informed by a survey in 2024, a "Roadshow" series of presentations at GA4GH workstreams in 2025 and a GA4GH Connect session in 2025. Notes from the roadshow presentations:

https://docs.google.com/document/d/1F7CN1SOcJ7qnvGWmjXJVxNH_1qrIUKCQi69SQV2vlsY/edit?tab=t.0

We note that this is the most work-in-progress of the three DaMaSc initiatives (the others being schema registry #79 and Ontology recommendations #80 )

Estimated Effort Level

Medium (3-6 months, moderate resources)

Success Criteria

If a standard such as this is adopted, all GA4GH approved data models will have consistent human and machine readable representations, use a set of preferred ontologies for terms, and have documentation that allows for easy adoption.

How will this issue aid GA4GH harmonization?

How does this aid harmonization of GA4GH products?
By providing a standard format for data model representation, can reduce development time for associated tooling (schema registry, validation, API development).

What barriers to organization-wide harmonization does this address?
Provides a framework for data model development so that groups can focus on the domain-specific content rather than representation (e.g. Experiments Metadata). .

Which specific alignment challenges does this solve?
For adopters of GA4GH data models, provides a harmonized framework for representation and documentation.

Does this require cross-work stream development?
Yes, for any workstream developing products that include data models.

Additional context

Please provide any additional pieces of information you feel is relevant to this issue

Work Streams Raising This Issue

  • Clinical & Phenotypic Data (Clin/Pheno)
  • Cloud Work Stream
  • Data Security
  • Data Use & Researcher IDs (DURI)
  • Discovery
  • Genomic Knowledge Standards (GKS)
  • Large Scale Genomics (LSG)
  • Regulatory & Ethics (REWS)
  • Data Models & Schemas Committee (DaMaSC)
  • Genomic Implementation Forum (GIF)
  • Technical Team
  • Other (specify below)

Other Groups Raising This Issue

No response

Work Streams That Will Be Impacted

  • Clinical & Phenotypic Data (Clin/Pheno)
  • Cloud Work Stream
  • Data Security
  • Data Use & Researcher IDs (DURI)
  • Discovery
  • Genomic Knowledge Standards (GKS)
  • Large Scale Genomics (LSG)
  • Regulatory & Ethics (REWS)
  • Data Models & Schemas Committee (DaMaSC)
  • Genomic Implementation Forum (GIF)
  • Technical Team
  • Other (specify below)

Other Groups That Will Be Impacted

No response

Key Stakeholders to Consult

No response

Products affected

No response

Additional Context

No response

Priority Level

None

Additional Tags

  • Documentation
  • API
  • Schema
  • Security
  • Performance
  • Interoperability
  • Compliance
  • User Experience
  • Infrastructure
  • Testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions