Computational biologist and machine learning scientist working at the intersection of genomics, proteomics, single-cell biology, and reproducible computational pipelines.
I have a Ph.D. in Computer Science and currently work in computational biology, building ML methods and scalable workflows for disease heterogeneity, patient stratification, and treatment response.
- Genomics + representation learning
- Single-cell and multi-omics analysis
- UK Biobank and large-scale proteomics
- Reproducible bioinformatics pipelines
If you only look at a few projects, these are the best entry points:
- Genotype representation learning – LD-aware VAE + transformer framework for discovering latent genomic structure and asthma-related biology.
- Single-cell benchmarking – reproducible single-cell workflows and perturbation-aware cell-state scoring.
- Large-scale proteomics – Olink NPX and UK Biobank pipelines for Long COVID and disease subtype discovery.
- End-to-end reproducible genomics pipelines – Nextflow, Docker, and configuration-driven workflows from raw sequencing data to analysis.
LD-aware representation learning for genotype data using VAEs and transformers.
- Recovering asthma-related genomic structure without phenotype labels
- Identified strong HLA class II and PDE4D signals from latent embeddings
- Demonstrated improved latent organization versus baseline approaches
Key result: recovered known asthma biology with embedding structure achieving ARI = 0.999 and revealing multiple independent HLA haplotype axes.
Benchmarking perturbation-aware cell-state scoring methods in single-cell RNA-seq data.
- Reproducible Scanpy workflow from preprocessing through scoring and visualization
- Compared multiple scoring approaches and control gene sets
- Includes interferon-stimulated PBMC analysis and communication scoring
Key result: built an interpretable benchmark where interferon-response programs achieved near-perfect separation while preserving biologically meaningful cell-state differences.
Scalable Olink NPX proteomics analysis for pediatric Long COVID and UK Biobank replication.
- Disease subgroup discovery using WHO-defined symptom profiles
- Spark + SQL workflows for tens of thousands of participants
- Regression, enrichment, volcano plots, and replication analyses
Key result: developed reproducible pipelines to compare neurocognitive and non-neurocognitive Long COVID subtypes across pediatric and UK Biobank cohorts.
Educational but production-style genomics workflow from FASTQ through variant calling, GWAS, PRS, and biological interpretation.
- Covers QC, alignment, variant calling, annotation, PCA, GWAS, PRS, and eQTL analysis
- Uses a centralized Conda environment and documented module-level workflows
- Excludes large/generated outputs from version control while preserving reproducible commands
Key result: provides a clean, reproducible template for moving from raw sequencing data to interpretable disease-associated variants.
Analysis of miRNA signatures associated with inhaled corticosteroid response in asthma.
- Differential expression, enrichment, and subgroup analyses
- Focus on miR-584-5p and neurocognitive phenotype differences
- Includes publication-ready plots and reproducible analysis scripts
Key result: identified candidate miRNA signatures associated with treatment response heterogeneity.
Python · PyTorch · Representation learning · VAEs · Transformers · SHAP · NLP
Genomics · GWAS/PLINK · Polygenic Risk Scores · Single-cell RNA-seq · Multi-omics · UK Biobank · Scanpy · Seurat · Olink NPX proteomics · NGS pipelines · WNN integration
Nextflow · Docker · Conda · SLURM · GitHub Actions · Spark SQL · UK Biobank RAP
NumPy · Pandas · scikit-learn · Matplotlib · Seaborn · SQL · PySpark
I enjoy building computational biology projects that are both scientifically meaningful and technically reproducible: methods that can move from raw data to interpretable biological insight, with clear workflows, versioned environments, and reusable code.

