Skip to content

[Issue]: Schema Registry #79

@nsheff

Description

@nsheff

Issue Title

GA4GH Schema Registry

Issue Type

Product Harmonization

Problem Statement

GA4GH workstreams independently develop domain-specific standards, each defining their own vocabulary and data representation schemas. These schemas exist in scattered locations with no centralized discovery mechanism, forcing researchers, pipeline developers, and data custodians to search across multiple repositories or duplicate work already completed by others. There is no standardized way to represent, version, or share schemas across the GA4GH ecosystem, making it difficult to track schema evolution, signal deprecation, or understand relationships between schemas. When organizations develop new schemas for emerging experimental methods, they often start from scratch rather than building on existing community work.
The GA4GH Schema Registry would provide a centralized, standardized resource where the community can discover, share, and govern schemas used across GA4GH products and biomedical research. Pipeline developers could publish schemas defining their input requirements, research consortia could find existing schemas for new experimental methods, and data custodians could discover GA4GH-compliant schemas to ensure interoperability with Data Connect and other products. The registry would support schema versioning, deprecation signals, relationships between schemas, and multiple format representations to accommodate diverse user communities.
Without a schema registry, fundamental interoperability barriers exist across GA4GH products. When Beacon, Phenopackets, Data Connect, and other specifications each define schemas independently, it becomes difficult to bridge these products or create federated systems leveraging multiple standards. Driver projects must manually reconcile different representations of common concepts like "subject," "sample," or units of measurement. This fragmentation slows adoption of GA4GH products and hinders organizations trying to understand which schemas to implement. By establishing a standardized schema registry specification, GA4GH can facilitate schema discovery and reuse, enable clearer versioning and governance, and provide the foundation for improved alignment across the entire ecosystem.

Scope Validation

✅ Harmonization Impact:
The Schema Registry enables harmonization by providing a centralized location where workstreams can discover and adopt common schemas for shared concepts like "subject," "sample," and units of measurement, rather than creating incompatible alternatives.

✅ Barrier Reduction:
The registry eliminates the discoverability barrier that causes workstreams to operate in isolation and duplicate schema development, while also addressing governance barriers through standardized versioning and deprecation mechanisms.

✅ Alignment Challenges:
This helps to avoid vocabulary and data representation misalignment across workstreams by making it visible when different products define the same concepts differently, enabling targeted harmonization efforts.

✅ Cross-Work Stream:
Yes, the Schema Registry requires cross-work stream development as it must accommodate schemas from Beacon, Phenopackets, Data Connect, GKS, and other workstreams, serving as shared infrastructure for the entire GA4GH ecosystem.

Proposed Solution(s)

Approach Options:

We recommend developing a GA4GH Schema Registry API Specification with a proof-of-concept implementation to demonstrate feasibility and gather community feedback, followed by a driver project to create a working GA4GH Schema Registry instance. Alternative approaches considered include using existing solutions like SchemaBlocks, FAIRsharing, or EMBL-EBI OLS, but these either lack GA4GH-specific governance, have insufficient adoption within the community, or focus on ontologies and controlled vocabularies rather than providing infrastructure for sharing data structure schemas.

Implementation:

This requires the existing focus group to continue developing the specification and proof-of-concept. Resources needed include technical expertise for API specification development, POC backend infrastructure, ongoing engagement with workstreams (particularly Beacon, Phenopackets, GKS, and Data Connect) to validate requirements, and a driver project to establish the production GA4GH Schema Registry instance where workstreams can publish their schemas. Existing work in progress includes Ian Fore's exploratory implementation at https://github.com/ianfore/ga4gh-starter-schema-repository with a running instance on Google Cloud Run, Jupyter notebooks demonstrating the API, the schema registry prototype at PEPhub, and Jonathan Fuerth's client.

Estimated Effort Level

Unknown (needs further assessment)

Success Criteria

Measurable Outcomes:

Deliverables include a completed GA4GH Schema Registry API Specification, a functional proof-of-concept implementation demonstrating core operations, and minimal governance recommendations.

For the second step, it will be a standard location for ga4gh users to share schemas.

Key Metrics:
Progress indicators include completion of the specification document, successful POC demonstration of list/read/access operations with versioning and metadata support, and validation by at least three GA4GH workstreams. Completion is achieved when the specification is finalized, the POC successfully handles schemas from multiple workstreams (e.g., Beacon, Phenopackets, GKS), and documentation enables independent implementations.

How will this issue aid GA4GH harmonization?

How does this aid harmonization of GA4GH products?
The Schema Registry enables harmonization by providing centralized infrastructure where workstreams can discover and adopt common schemas for shared concepts like "subject," "sample," and units of measurement, rather than creating incompatible alternatives independently.

What barriers to organization-wide harmonization does this address?
The registry eliminates the discoverability barrier that causes workstreams to operate in isolation and duplicate schema development, while also addressing governance barriers through standardized versioning and deprecation mechanisms that make it clear which schemas are current and recommended.

Which specific alignment challenges does this solve?
This solves vocabulary and data representation misalignment across workstreams by making it visible when different products define the same concepts differently, enabling targeted harmonization efforts and facilitating the pilot interoperability framework bridging Experiments Metadata Standard, Beacon, Phenopackets, and GKS.

Does this require cross-work stream development?
Yes, the Schema Registry requires cross-work stream development as it must accommodate schemas from Beacon, Phenopackets, Data Connect, GKS, and other workstreams, serving as shared infrastructure for the entire GA4GH ecosystem.

Additional context

Please provide any additional pieces of information you feel is relevant to this issue

Work Streams Raising This Issue

  • Clinical & Phenotypic Data (Clin/Pheno)
  • Cloud Work Stream
  • Data Security
  • Data Use & Researcher IDs (DURI)
  • Discovery
  • Genomic Knowledge Standards (GKS)
  • Large Scale Genomics (LSG)
  • Regulatory & Ethics (REWS)
  • Data Models & Schemas Committee (DaMaSC)
  • Genomic Implementation Forum (GIF)
  • Technical Team
  • Other (specify below)

Other Groups Raising This Issue

No response

Work Streams That Will Be Impacted

  • Clinical & Phenotypic Data (Clin/Pheno)
  • Cloud Work Stream
  • Data Security
  • Data Use & Researcher IDs (DURI)
  • Discovery
  • Genomic Knowledge Standards (GKS)
  • Large Scale Genomics (LSG)
  • Regulatory & Ethics (REWS)
  • Data Models & Schemas Committee (DaMaSC)
  • Genomic Implementation Forum (GIF)
  • Technical Team
  • Other (specify below)

Other Groups That Will Be Impacted

No response

Key Stakeholders to Consult

No response

Products affected

Any products that share or use schemas could be affected.

Additional Context

No response

Priority Level

None

Additional Tags

  • Documentation
  • API
  • Schema
  • Security
  • Performance
  • Interoperability
  • Compliance
  • User Experience
  • Infrastructure
  • Testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions