Code to reproduce the analyses in “Population-scale Characterization of the Oral Microbiome and Associations with Metabolic Health”.
- Population-scale, high-resolution metagenomics with deep metabolic phenotyping: We profile standardized bilateral buccal-swab whole-metagenome data in 9,431 HPP adults, paired with 44 metabolic measures spanning liver ultrasound, CGM, and DXA.
- A unified, rigorous multi-layer MWAS framework: We systematically test associations across strain, gene-family, and pathway layers using covariate-adjusted regression and layer-wise multiple-testing control, enabling direct comparison of signals across metabolic systems.
- Actionable outputs with translational and external support: We deliver a multi-system oral–metabolic association atlas with prioritized cross-phenotype markers, demonstrate proof-of-concept metabolic disease classification using phenotype-selected oral features, and provide independent directional replication at genus resolution.
- Controlled Access: Due to ethical and IRB requirements, HPP data is available through a controlled-access portal.
- Access Portal: https://humanphenotypeproject.org/data-access
- Process: Researchers must submit a statement of purpose and sign a data use agreement. Upon approval, data can be accessed in a secure environment.
- TRE Tutorial: After obtaining access, please refer to
User-guide-for-TRE.pdffor a detailed guide on how to use the Trusted Research Environment (TRE). - Ethics Approval: Weizmann Institute IRB #1719-1.
conda create -n oral_hpp python==3.11
pip install -r requirements.txt
The analysis follows a sequential workflow where inputs and outputs are chained. While the high-level steps are outlined below, please refer to: * Subdirectory READMEs: Each folder contains a local README.md with detailed execution instructions and script-level documentation.
-
Preprocess (
preprocess/)
Clean phenotypes; standardize strain/pathway/gene-family abundance (zero-replacement → normalization → PPM → log₁₀). -
Association analysis (
association_analyse/)
OLS (age, sex, smoking); Then Bonferroni correction (correct_P_value_*). -
Key oral features (
Identification_key_oral_features/)
Rank by association breadth and take top features per system. -
Oral feature grouping (
oral_features_classfication/)
Classify significant strain/pathway into Favourable / Adverse / Mixed from association directions across liver, CGM, body. -
Metabolic disease classfication (
metabolic_diseases/)
Select pathways linked to disease-related phenotypes (5-fold CV); train classifiers (LightGBM) on strain/pathway abundance; evaluate with cross-validation. -
Replication (
replication_study/)
In an independent cohort (NHANES): preprocess genus and phenotype, run association (BMI, waist circumference), then compare direction.
