Skip to content

unified dataset layout: one root directory for variants, genotypes, annotations #95

@vineetver

Description

@vineetver

Ingest currently produces two disconnected sibling directories: / for variants and .genotypes/ for genotypes. Annotate and enrich add more siblings. The user manually threads paths between commands. The .cohort/ store knows about cohorts but not the ingest/annotate stages that feed them.

Proposed: everything lives under one root (either .cohort/ or a named dataset dir). Each command discovers what it needs from the store. No manual -o paths for the common case.

.cohort/
  datasets/<name>/
    manifest.json
    variants/chromosome={chr}/...
    genotypes/samples.txt, chromosome={chr}/...
    annotations/chromosome={chr}/...
  cohorts/<id>/
    manifest.json
    sparse_g.bin, variants.parquet, membership.parquet

Ingest writes to datasets/. Annotate finds unannotated datasets and adds annotations/. Staar builds cohorts from annotated datasets. The user only specifies the input VCF and trait file. Everything else is resolved from the store.

Related: #64, #59, #62, #27, #87

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestingestVCF/genotype ingest pipeline

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions