DGN is a framework integrating scRNA-seq and GWAS data to explore disease-associated genes, cell types, and gene network modules from a network perspective.
DGN requires Python >= 3.6 and Java SE Runtime Environment (JRE) >= 1.8.
# download
git clone https://github.com/pmglab/DGN
# install dependent packages
cd DGN
python -m pip install -r requirements.txt
# test
python dgn.py --versionFor single-cell RNA sequencing (scRNA-seq), the expression matrix must be preprocessed by annotating cell types and then splitting the matrix according to these cell types.
Each cell type should have a corresponding expression matrix file, where each row represents a gene and each column represents a cell.
The values should be tab-separated. The file should be named in the format {CellTypeName}.txt.gz. All expression matrix files should be placed in a single directory,
referred to as EXPR_DIR in the following demonstrations.
The human and mouse reference single-cell datasets used in our DGN study can be downloaded from https://doi.org/10.6084/m9.figshare.26771971.
The GWAS summary statistics should include three columns: chromosome number, position, and p-value. Each row represents a variant.
The GWAS summary statistics file is referred to as GWAS_FILE for demonstration purposes in the following sections.
Reference genotype data that matches your GWAS population and coordinate version can be downloaded from https://doi.org/10.6084/m9.figshare.26771755.
For example, the hg19 version of the European population genotype file is named EUR.hg19.vcf.gz and will be referred to as REF_VCF in the following demonstrations.
Assume your output directory is set to OUTPUT_DIR and the abbreviation for the GWAS phenotype is PHENO_ABBR. The command to run is as follows:
python dgn.py \
--expression_dir EXPR_DIR \
--gwas_path GWAS_FILE \
--vcf_ref REF_VCF \
--output_dir OUTPUT_DIR \
--phenotype_abbr PHENO_ABBRMake sure to replace these placeholder variables with the actual values relevant to your specific situation.
Output Directory Structure:
- The
intermediatefolder stores temporary result files generated during the network construction process. - The
node_scorefolder stores gene connectivity profile files (and, if--run_expr_deseis set, gene expression profile files as well). - The
resultfolder stores the formal DGN results, which include three aspects: disease-associated genes, cell types, and gene network modules. Detailed descriptions are provided below.
The disease-associated gene results are stored in {PHENO_ABBR}.gene.assoc.condi.txt. The meanings of the column names in the table are as follows:
| Header | Description |
|---|---|
| Gene | Gene symbol |
| Chrom | Chromosome of the gene |
| StartPos | Start coordinate of the gene |
| EndPos | End coordinate of the gene |
| #Var | Number of variants within the gene |
| Group | LD group number. Conditional ECS tests were performed for genes within the same LD group. |
| ECSP | p-value of ECS |
| CondiECSP | p-value of the conditional gene-based association tests by conditional ECS |
| GeneScore | Gene’s selective-expression score. A gene with a high score will be given higher priority to enter the conditioning procedure. |
The disease-associated cell types results are stored in {PHENO_ABBR}.celltype.txt. The meanings of the column names in the table are as follows:
| Header | Description |
|---|---|
| TissueName | Name of the tissue being tested |
| Unadjusted(p) | Unadjusted p-values for the tissue-phenotype associations |
| Adjusted(p) | Adjusted p-values calculated by adjusting both selection bias and multiple testing |
| Median(IQR)SigVsAll | Median (interquartile range) expression of the conditionally significant genes and all the background genes |
(1) The complete results of disease-associated gene modules are stored in {PHENO_ABBR}.assoc_gene_module.txt.
The functional enrichment annotation results for significantly associated modules (FDR < 0.1) are stored in
{PHENO_ABBR}.assoc_gene_module_anno.xlsx. The meanings of the column names in the table are as follows:
| Header | Description |
|---|---|
| module_id | The module ID |
| cell_type | The cell type in which the module is located |
| p | The module association p-value |
| p.adj_fdr | The p-value adjusted using the Benjamini-Hochberg FDR method |
| assoc_gene_number | The number of disease-associated genes within the module |
| module_gene_number | The total number of genes within the module |
| assoc_gene | The disease-associated genes in the module |
| module_gene | All the genes in the module |
| "GO:BP," "GO:CC," and "GO:MF" | The GO enrichment results for Biological Process, Cellular Component, and Molecular Function, respectively. |
| KEGG | The KEGG pathway enrichment results. |
(2) The visualization results of disease-significantly associated modules are stored in the {PHENO_ABBR}.plot folder.
The file {module_id}.module_network.htmlcontains the network diagram of the module, while{module_id}.module_heatmap.pngshows the heatmap of the network edge weights.
For example:
![]() |
![]() |
|---|---|
{module_id}.module_network.html |
{module_id}.module_heatmap.png |
| Flag | Description | Default |
|---|---|---|
--expression_dir |
Input a directory that contains gene expression profile of different cell types. Each file represents a specific cell type, with filenames following the convention "{CellTypeName}.txt.gz". |
| Flag | Description | Default |
|---|---|---|
--gwas_path |
GWAS summary table with header. | |
--chrom_col |
Chromosome column in GWAS summary table. | CHR |
--pos_col |
Base position column in GWAS summary table. | BP |
--p_col |
P-value column in GWAS summary table. | P |
--buildver |
Specifies the reference genome version of the coordinates. | hg19 |
| Flag | Description | Default |
|---|---|---|
--output_dir |
Output directory. | |
--phenotype_abbr |
Phenotype abbreviation, used as the file identifier for this phenotype in the output results. |
| Flag | Description | Default |
|---|---|---|
--trans_gene_symbol |
Convert the gene ID in the expression profiles to gene symbol of HGNC. | |
--min_expr_value |
The minimum average gene expression level used for constructing co-expression networks. | 0 |
--min_cell |
The minimum number of cells or samples required in the analysis. | 100 |
--max_cell |
The max number of cells or samples required in the analysis. | 1000 |
--min_k |
The minimum value of k in normalization for co-expression network. | 0.5 |
--max_k |
The max value of k in normalization for co-expression network. | 1.5 |
--fold_IQR |
How many times the IQR is the normal range of the average |r| |
1.5 |
--edge_method |
The method for calculating edge weights (i.e., gene correlations),pearson or cs-core. |
cs-core |
--resample_size |
Sample size for resampling edge weights when correcting the co-expression network. | 100000 |
--keep_expr |
No need to recalculate the gene centrality in the gene co-expression network for the next analysis. |
| Flag | Description | Default |
|---|---|---|
--vcf_ref |
Specifies a VCF file of genotypes sampled from a reference population. | |
--multiple_testing |
Specifies the method for multiple testing correction. bonf denotes performing Bonferroni correction; benfdr denotes controlling false discovery rate by the Benjamini–Hochberg method; fixed denotes no correction. |
benfdr |
--p_value_cutoff |
Specifies the threshold of the adjusted p-value for fine-mapping. | 0.05 |
--top_n_gene |
Maximum number of genes entering conditional association analysis. | 1000 |
--nt |
Specifies the number of threads. | 1 |
--rm_hla |
Remove HLA region. | |
--java_path |
Java path. | java |
--jvm_gb |
JVM value (GB). | 20 |
| Flag | Description | Default |
|---|---|---|
--assoc_cell_p_cutoff |
The adjusted p cutoff for associating cell types. The significant cell types will be used for associated network module analysis. | 0.05 |
--module_gene_score_n_top_genes |
In the module detection function, specify the number of top genes with high disease-related scores for module detection | 5000 |
--module_cut_edge_weight |
Minimum edge weight for pruning nodes in modules. | 0.5 |
--module_plot_cut_edge_weight |
Minimum edge weight for pruning nodes in plotting modules. | 0.6 |
--show_node_label_top_n |
Display the labels of the top n genes by connectivity. | 5 |
--function_enrich_p_cutoff |
P-value threshold for functional enrichment analysis by g:Profiler. | 0.05 |
--function_enrich_top_n |
Maximum number of functional enrichment terms displayed. | 3 |
| Flag | Description | Default |
|---|---|---|
--run_expr_dese |
Calculate the mean expression profile of genes and run DESE. | |
--normalize_expr |
The normalization method for the expression profiles. no denote skip normalization, cpm denotes count per million (CPM) normalization. |
cpm |
--log_trans_expr |
Transform the expression value into log2(x+1). |

