Block-Based Genotype Embedding Analysis

Unsupervised learning of subject-level genomic representations from LD-aware blocks, with application to asthma-relevant loci and downstream phenotype association.

Phase 1 learns a compact β-VAE embedding per LD block. Phase 2 aggregates block embeddings across the genome via a Transformer with cross-block attention, producing subject-level embeddings and interpretable block-importance weights.

Key Idea

Standard genotype analysis treats each SNP independently or applies global LD pruning, losing local genomic context. This project takes a hierarchical approach:

Phase 1 — Per-block β-VAE. Each LD block is encoded independently into a low-dimensional latent vector that captures local haplotype structure.
Phase 2 — Cross-block Transformer. A Transformer aggregates all block embeddings into a single subject-level representation. Learned attention weights identify which blocks are most informative for organizing genetic variation across individuals.

The result is an embedding that is biologically interpretable (attention scores), within same ancestry, and structured for downstream analysis (clustering, phenotype association, leave-one-block-out validation).

Key Results

HLA class II dominates the learned space. Transformer attention weights consistently rank HLA class II blocks highest; these blocks drive the primary axis of subject-level variation.
HLA block embeddings outperform ancestry PCs. HLA block_PC1 explains subject cluster structure beyond what genotype PC1–10 can account for, confirming genuine biological signal rather than ancestry confounding.
PDE4D emerges after masking HLA. A leave-HLA-out re-clustering experiment (script 03_leave_hla_out_analysis.py) reveals PDE4D as the next most structurally informative block — consistent with its established role in asthma and β-agonist pharmacogenomics.
Phenotype signal is real but subtle; IgE is the strongest. Continuous phenotypes (blood eosinophil count, IgE, lung function spirometry test, exacerbation) show the most consistent association with block-level PC features across subjects.
Biology recovered without phenotype labels. The model was trained unsupervised on genotype data only. The emergence of HLA class II and PDE4D in post-hoc analysis validates that the learned geometry reflects known asthma biology.

What Phase 2 Adds Beyond PCA

Conventional PCA on raw genotype data primarily captures population structure — the top components reflect ancestry rather than disease-relevant biology. Phase 1 VAE embeddings preserve local LD-block haplotype structure that SNP-level PCA discards. Phase 2 adds cross-block context: the Transformer learns which blocks co-vary meaningfully across subjects, reorganizing rather than destroying the Phase 1 geometry (Phase 1 vs Phase 2 pairwise-distance correlation ≈ 0.68). The result is a subject-level space where HLA class II dominates the primary axis, PDE4D emerges as the next structurally informative block after HLA removal, and IgE shows stronger phenotype association — biological signal that ancestry-adjusted PCA does not recover.

Figures

1 — Pipeline architecture

Schematic of the two-phase architecture: per-block VAE (Phase 1) feeding into the cross-block Transformer (Phase 2) to produce subject embeddings is shown above.

2 — Subject embedding PCA reveals stable genomic structure

PCA of Phase 2 subject embeddings reveals three reproducible strata (k=3; ARI = 0.999). The weak silhouette score (0.139) indicates the learned space is structured as a continuous gradient rather than sharply separated clinical subtypes.

Source: scripts/analysis/02_subject_cluster_analysis.py. Filename: docs/images/subject_pca_clusters.png

3 — HLA class II dominance

HLA class II subblocks strongly organize the Phase 2 embedding space. HLA sb15 explains far more cluster variance than ancestry PCs (η² = 0.767 vs 0.051 for genotype PC1) and correlates strongly with the main embedding axis (EmbedPC1 r = −0.88). Source: scripts/analysis/02_subject_cluster_analysis.py.

4 — Three-finding summary

Slide-style summary panel: (1) HLA class II anchors the embedding space, (2) PDE4D is the next signal after HLA removal, (3) phenotype associations are present and biologically coherent without supervised training. Source: conclusions/summary slide. Suggested filename: docs/images/findings_summary.png

Repository Structure

scripts/
  core/       Core pipeline — Phase 1 VAE, Phase 2 Transformer, block analysis, plotting
  analysis/   Numbered post-hoc scripts (01–07): phenotype association, clustering,
              HLA validation, confounder analysis, 17q21 validation
  archive/    Superseded wrappers, exploratory one-offs, debug scripts
configs/      YAML configs for Phase 1, Phase 2, and no-HLA variant
data/         Genotype block files and block manifest (access-restricted, not tracked)
results/      Pipeline outputs (access-restricted, not tracked)
metadata/     Phenotype table, eigenvec file (access-restricted, not tracked)
docs/         Method notes and figures
environment.yml
WORKFLOW.md   Step-by-step execution guide with CLI examples
run_pipeline.sh  Single entry point — runs full pipeline or --dry-run input check
CLAUDE.md     AI assistance constraints and workflow summary

See WORKFLOW.md for full CLI instructions, expected inputs/outputs per step, and execution order.

Quick Start

conda env create -f environment.yml
conda activate genotype-embedding-env

# Run full pipeline (requires restricted data in data/ and metadata/)
./run_pipeline.sh

# Validate inputs only — no training
./run_pipeline.sh --dry-run

# Or run phases individually
python scripts/core/VAE_phase1.py --config configs/config_phase1.yaml
python scripts/core/attention_phase2.py --config configs/config_phase2.yaml

# Post-hoc analysis (example)
python scripts/analysis/03_leave_hla_out_analysis.py

Use --dry-run on run_pipeline.sh or on either phase script to validate inputs without running training. Full details in WORKFLOW.md.

Synthetic Smoke Test

To verify the Phase 1 → Phase 2 pipeline wiring without restricted data access:

./test_run.sh

Expected runtime: under 5 minutes on CPU. The script generates 30 fake subjects across 4 synthetic LD blocks (10 SNPs each, random integers in {0, 1, 2}), runs Phase 1 and Phase 2 with minimal settings (3 epochs), and confirms that required output files are written to results/synthetic_test/ and results/synthetic_test2/.

These outputs validate pipeline wiring only. Synthetic data has no biological meaning and should not be interpreted scientifically.

Data

Raw genotype data and phenotype tables are not version-controlled (access-restricted). The repository preserves analysis logic, configuration, derived summaries, and documentation sufficient for rerunning with appropriate input access.

Core inputs: per-block .npy genotype matrices, block manifest TSV, subject phenotype CSV, ancestry eigenvec file.

Environment

Python 3.10+. Key dependencies: torch, numpy, pandas, scikit-learn, matplotlib, umap-learn, hdbscan, statsmodels, scipy, seaborn, yaml.

conda env create -f environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Block-Based Genotype Embedding Analysis

Key Idea

Key Results

What Phase 2 Adds Beyond PCA

Figures

1 — Pipeline architecture

2 — Subject embedding PCA reveals stable genomic structure

3 — HLA class II dominance

4 — Three-finding summary

Repository Structure

Quick Start

Synthetic Smoke Test

Data

Environment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
configs		configs
data/synthetic		data/synthetic
docs		docs
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
WORKFLOW.md		WORKFLOW.md
environment.yml		environment.yml
run_pipeline.sh		run_pipeline.sh
test_run.sh		test_run.sh

Folders and files

Latest commit

History

Repository files navigation

Block-Based Genotype Embedding Analysis

Key Idea

Key Results

What Phase 2 Adds Beyond PCA

Figures

1 — Pipeline architecture

2 — Subject embedding PCA reveals stable genomic structure

3 — HLA class II dominance

4 — Three-finding summary

Repository Structure

Quick Start

Synthetic Smoke Test

Data

Environment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages