unified dataset layout: one root directory for variants, genotypes, annotations

Ingest currently produces two disconnected sibling directories: <output>/ for variants and <output>.genotypes/ for genotypes. Annotate and enrich add more siblings. The user manually threads paths between commands. The .cohort/ store knows about cohorts but not the ingest/annotate stages that feed them.

Proposed: everything lives under one root (either .cohort/ or a named dataset dir). Each command discovers what it needs from the store. No manual -o paths for the common case.

```
.cohort/
  datasets/<name>/
    manifest.json
    variants/chromosome={chr}/...
    genotypes/samples.txt, chromosome={chr}/...
    annotations/chromosome={chr}/...
  cohorts/<id>/
    manifest.json
    sparse_g.bin, variants.parquet, membership.parquet
```

Ingest writes to datasets/. Annotate finds unannotated datasets and adds annotations/. Staar builds cohorts from annotated datasets. The user only specifies the input VCF and trait file. Everything else is resolved from the store.

Related: #64, #59, #62, #27, #87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unified dataset layout: one root directory for variants, genotypes, annotations #95

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

unified dataset layout: one root directory for variants, genotypes, annotations #95

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions