staar: favor grm subcommand for FastSparseGRM#131
Merged
Conversation
Adds `favor grm --cohort <id> --king-seg <file>` which builds a sparse ancestry-adjusted GRM + PCA scores from a pre-built cohort store and KING IBD segment output. Mirrors upstream FastSparseGRM (Lin & Dey 2024) end-to-end: KING .seg parsing and degree filtering (R/getUnrels.R, R/calcGRM.R:29-68), greedy unrelated selection with ancestry divergence tie-breaking (R/getUnrels.R:81-125, cppFunct.cpp:calculateDivergence: 525-576), randomized PCA via power iteration on carrier-indexed G*v / G'*v operations (R/runPCA.R:drpca:2-78, cppFunct.cpp:postmultiply:252- 283 and premultiply:331-366), block-wise ISAF-adjusted kinship estimation with full two-pass re-estimation for large components (R/calcGRM.R:72-173). Per-chromosome architecture matches the existing STAAR scoring loop: each chromosome opens one ChromosomeView, walks carriers, accumulates, releases. Peak memory is one chromosome mmap plus accumulators. GRM output is a 3-column TSV loadable directly by the existing kinship::load_kinship path. New module src/staar/grm/ with king.rs (KING parser + union-find components), unrelated.rs (greedy selection + packed-byte divergence with 256x256 lookup tables), pca.rs (allele freq + postmultiply + premultiply + randomized SVD), estimate.rs (block-wise kinship + two- pass), cache.rs (fingerprint + probe + save under .cohort/cache/grm/), types.rs. CLI surface: one new subcommand, no changes to favor staar.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds a sparse ancestry-adjusted GRM + PCA scores from a pre-built cohort store and KING IBD segment output. Mirrors the full FastSparseGRM pipeline (Lin & Dey, Nature Genetics 2024) end-to-end on the carrier-indexed genotype store.
Per-chromosome architecture: each chromosome opens one ChromosomeView, walks carriers, accumulates, releases. Peak memory is one chromosome mmap plus accumulators. Matches the existing STAAR scoring loop structure so HPC operators can parallelize by chromosome.
New module src/staar/grm/ with: KING .seg parser + union-find components (king.rs), greedy unrelated selection with packed-byte divergence + 256x256 lookup tables (unrelated.rs), carrier-indexed G*v/G'*v + randomized PCA (pca.rs), block-wise ISAF-adjusted kinship with full two-pass re-estimation (estimate.rs), cache under cohorts//grm/ (cache.rs). GRM output is a 3-column TSV loadable directly by the existing kinship::load_kinship path.
The --grm flag on favor staar wires the sealed GRM artifact into the pipeline. It loads the kinship matrix and injects PCA scores as covariates automatically. Mutually exclusive with --kinship. Rejects PC* columns in --covariates to prevent double-adjustment. Passing --grm with no value infers the path from --cohort.
Closes #99