Skip to content

bayesomicslab/BSEEJ

Repository files navigation

Bayesian Reconstruction and Differential Testing of Excised Introns

Authors:
Marjan Hosseini
Devin J. McConnell
Derek Aguiar

Graphical abstract




The code for the BSEEJ method is provided in bseej.py.
A compressed folder of junction files for one example gene (A2ML1) is included in this repository.
To run BSEEJ with default configurations, first install dependencies (e.g., from requirements.txt or environment.yml), then run bseej.py. A more detailed guide follows.


Installation

  1. Clone the repository:
git clone https://github.com/bayesomicslab/BSEEJ.git
cd BSEEJ
  1. Install the project dependencies and activate the environment using one of these options:

Option 1 (recommended): Conda

conda env create -f environment.yml
conda activate bseej_env

Option 2: venv + pip (uses pinned versions from requirements.txt)

python -m venv .venv
# Linux / macOS
source .venv/bin/activate


pip install --upgrade pip
pip install -r requirements.txt
  1. Unzip the example data:
unzip A2ML1.zip

How to Run

Quick start (uses default parameters):

python3 bseej.py

Example with explicit paths/params (documents intent and works on fresh setups):

python3 bseej.py -g A2ML1 -p ./A2ML1 -o ./results -k 1 -i 1000 -e 0.01 -a 1 -r 1 -s 1

Note: The junction path must contain .junc files (e.g., the unzipped A2ML1 folder).
The output directory will be created if it does not exist.


Usage

python3 bseej.py [options]

Options (defaults shown by python3 bseej.py --help):

  • -g, --gene <GENE_NAME>: Gene symbol to train (e.g., A2ML1; default: A2ML1)
  • -p, --junction_path <DIR>: Directory containing .junc files (default: ./A2ML1)
  • -o, --out_dir <DIR>: Output directory (created if missing; default: ./results)
  • -k, --clusters <INT>: Number of clusters (default: 1)
  • -i, --max_iter <INT>: Maximum iterations (default: 1000)
  • -e, --eta <FLOAT>: Model parameter η (default: 0.01)
  • -a, --alpha <FLOAT>: Model parameter α (default: 1.0)
  • -r, --r <FLOAT>: Model parameter r (default: 1.0)
  • -s, --s <FLOAT>: Model parameter s (default: 1.0)
  • -h, --help: Show help and exit

An example notebook (example.ipynb) demonstrates usage.

Prepare the input:

For creating the EGA and simulated data BAM files we ran the STAR [1] aligner for fast and accurate alignment.

STAR --runThreadN 20 --genomeDir ../genome_data/genome_index/ --outFileNamePrefix ./person_${i}_ --twopassMode  Basic --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate  --readFilesIn ${1}/person_${i}_1.fa ${1}/person_${i}_2.fa

In this example $${1}$ is the directory where the files are located. The input file 'person_*_1.fa' is a collection of genes for the ith sample on forward sequence and the second is the collection of genes on the backward sequence. This allows us to use the twopassMode and we also used the intronMotif in order to obtain spliced alignments (XS). Once the files have been aligned they are then separated out into the individual bam files of just one gene to work on at a time.

Regtools [2] was used for an efficient filtering of the junctions.

regtools junctions extract -s 0 -a 6 -m 50 -M 500000 %s -o %s.junc  

Here the two %s are the bam file and output name respectively. On all data used in this project, EGA, Geuvadis, and simulations, we used the following flags:

* -s: finds XS/unstranded flags
* -a: minimum anchor length into exon (6 bp)
* -m: minimum intron size (50 bp)
* -M: Maximum intron size (500000 bp)
portcullis prep -t 20 -v --force -o %s_portcullis/1-prep/ GRCh38.primary_assembly.genome.fa %s/%s.bam

First step of portcullis [3] is prep and it takes in the fasta file from the reference genome used, here is an example from the data simulations. It is important to note that portcullis was run on our simulations and both experimental results. Here %s is there name of the folder to direct output to and the bam file name.

portcullis junc -t 20 -v -o %s_portcullis/2-junc/portcullis_all --intron_gff %s_portcullis/1-prep/    

The next step that is being done is extraction of the junctions into a gff format. Here %s is just giving it the name of the folder to look into.

portcullis filt -t 20 -v -n --max_length 500000 --min_cov 30 -o %s_portcullis/3-filt/portcullis_filtered --intron_gff %s_portcullis/1-prep/ %s_portcullis/2-junc/portcullis_all.junctions.tab

Finally, we set some filtering using portcullis. We only keep introns that have a max length of 500000, do not use the machine learning filtering, and have a minimum coverage of 30. Here again %s is just point to the gene folder to look into for the files. After portcullis is complete we do an overlap check between the junctions found from regtools and portcullis and only keep the ones that overlap with at least 90%.

Model:

BREM Probabilistic Graphical Model



Main variables and parameters include:

  • $V$ is the set of unique intron excisions, indexed by $v$ and its size of denoted by $|V|$.

  • $N$ is the number of samples and are indexed by $i$.

  • $J_i$ is the number of intron excisions in ith sample.

  • $K$ is the number of clusters (indexed by $k$).

  • For the jth intron excision in the ith sample, we assign a cluster k.

  • Graph $G = (V, E)$, where V is the set of unique intron excision and there is an edge between two intron excisions iff they intersect each other.

  • $\Omega$ is the set of all the independent sets in $G$.

  • $r, s$ are priors for $\pi$ (Beta distribution):

$$ \pi_k \sim Beta(r, s), \forall k = {1, \dots, K} $$



* Increase in mean of Beta$(r,s)$ results in increase in cluster size |SEEJ|.

  • The structure of a (clusters) SEEJs consists of the inclusion or exclusion of intron excisions.


$$ b_{kv} \sim Bernoulli(\pi_k), \ \forall v \in C \\ \text{s.t.} \ b_k \in \Omega $$

  • For cluster k, $\beta_{k}$ is a $|V|$-dimensional Dirichlet which represents the distribution of the cluster k over the intron excisions.

$$ \beta_k \sim Dirichlet_{|V|}(\eta \odot b_k) $$



* $\eta = (\eta_1, \eta_2, ..., \eta_{|V|})$ is $\beta$ variable prior.

  • For the ith sample, variable $\theta_i$ is a $K$-dimensional Dirichlet distribution and represents the proportions of the clusters in sample $i$.

$$ \theta_i \sim \text{Dirichlet}_K(\alpha) $$



* $\alpha = (\alpha_1, \alpha_2, ..., \alpha_K)$ is $\theta$ variable prior.

  • Variable $z_{ij}$ is the cluster assignment for jth intron excision in ith sample. It can take a natural value between 1 and $K$ and follows a Multinomial:

$$ z_{ij} \sim \text{Multinomial}(\theta_i) \\ z \in {1, \ldots, K}^{N \times J_i} $$

  • In the ith sample, the jth intron excision is $w_{ij}$ and is observed and follows a Multinomial distribution:

$$ w_{ij} \sim \text{Multinomial}(\beta_{z_{ij}}) $$

References

[1] Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., ... & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21.

[2] Feng, Y. Y., Ramu, A., Cotto, K. C., Skidmore, Z. L., Kunisaki, J., Conrad, D. F., ... & Griffith, M. (2018). RegTools: Integrated analysis of genomic and transcriptomic data for discovery of splicing variants in cancer. BioRxiv, 436634.

[3] Mapleson, D., Venturini, L., Kaithakottil, G., & Swarbreck, D. (2018). Efficient and accurate detection of splice junctions from RNA-seq with Portcullis. GigaScience, 7(12), giy131.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •