Skip to content

kkyamada/laser

Repository files navigation

LASER: LLM/ML-assisted seBE missense sgRNA design

LASER is a pipeline for designing sgRNAs targeting missense mutations introduced by split-engineered base editors, assisted by large language models (LLMs) and machine learning (ML) models. The pipeline is currently being refined to improve performance and computational time, which will become available in future. Make sure to see the LICENSE before usage, as some models this repository relies on have specific licensing agreements.

Summary

1. Dependencies and installation

Our LASER pipeline combines multiple machine-learning models, some of whose python and/or package dependencies conflict with each other. Therefore, we install some models in an independent virtual environment using pyenv and poetry/pipenv and call each environment as needed. General requirements are as follows.

  • CUDA: 11.3
  • pyenv: 2.3.36
  • samtools: 2.10

1.1. Virtual environment for the LASER pipeline

The LASER pipeline was developed with python 3.8.12. It is highly recommended to use the same python version to avoid incompatibility of dependencies. Install the required python version as follows using pyenv.

pyenv install 3.8.12
pyenv local 3.8.12

Build the virtual environment as follows. The required packages are listed in the pyproject.toml file located in the main directory of this repository and can be installed with poetry.

python3 -m pip install poetry
python3 -m poetry install
python3 -m poetry run pip install torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

Set up Pangolin in the same virtual environment as follows.

cd ../
git clone https://github.com/tkzeng/Pangolin.git
cd laser
python3 -m poetry shell
poetry run pip install ../Pangolin
poetry run pip install setuptools=57.5.0
poetry run pip install pyvcf==0.6.8

Clone the necessary external repositories:

cd ../
git clone https://github.com/OATML-Markslab/EVE.git
git clone https://github.com/OATML-Markslab/Tranception.git
git clone https://github.com/maxwshen/be_predict_efficiency.git
git clone https://github.com/maxwshen/be_predict_bystander.git

Download pre-computed Tranception model checkpoint.

cd ../Tranception/
mkdir model
cd model
curl -o Tranception_Large_checkpoint.zip https://marks.hms.harvard.edu/tranception/Tranception_Large_checkpoint.zip
unzip Tranception_Large_checkpoint.zip
rm Tranception_Large_checkpoint.zip

Make sure that the project directories are on the same level as follows.

  • laser
  • EVE
  • Tranception
    • model
      • Tranception_Large
  • Pangolin
  • be_predict_efficiency
  • be_predict_bystander

1.2. Virtual environment for BE-HIVE models

BE-HIVE models are required for the LASER pipeline, but their environment has some version conflicts with that of LASER. Install the required python version for BE-HIVE as follows using pyenv.

pyenv install 3.6.3

Build the virtual environment as follows. We provided requirements with poetry under the poetry_files directory of this repository.

cp laser/poetry_files/* ./be_predict_efficiency/
cd be_predict_efficiency
pyenv local 3.6,3
python3 -m pip install poetry
python3 -m poetry install

2. Usage

2.1. General guideline

This repository supports sgRNA library design for split-engineered base editors and also post-hoc analysis of those sgRNA libraries to provide the predicted effect score (LASER score) for each sgRNA.

2.2. Curating genomic information

This step is designed to collect genomic information, such as primary sequences and exon positions, of the genes of interest. Set up the path, where you would like to save the primary sequence files and annotation files and the target species (only mouse or human is supported in this project). It is also recommended to set this path in your bash/zsh profile.

export DB=PATH_TO_DATABASE
export LASER_SPECIES=mouse

Run the following script download_database.sh to download primary assembly sequences and annotations of human and mouse genomes.

cd scripts
source download_database.sh

The above script will generate the following files under the directory you specified as DB.

  • Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz: Human primary assembly from Ensembl
  • gencode.v46.annotation.gtf.gz: Human genome annotations from GENCODE
  • gencode.v46.ensembl.annotation.txt: Curated human genome annotations for Ensembl Canonical Transcripts.
  • Mus_musculus.GRCm39.dna.primary_assembly.fa.gz: Mouse primary assembly from Ensembl
  • gencode.vM36.annotation.gtf.gz: Mouse genome annotations from GENCODE
  • gencode.vM36.ensembl.annotation.txt: Curated mouse genome annotations for Ensembl Canonical Transcripts.

Run the following script feature0_preprocess.sh to

    1. Curate genomic information of each gene of interest listed in gene_list_example.txt (one gene per line).
    1. Enumerate all possible sgRNA(NGG PAM) spacer sequences overlapping with the coding region of each gene of interest listed in gene_list_example.txt.
    1. Curate sgRNA sequences and information for the downstream analysis. You may customize the output directory path (OUTPUTDIR) and base editor type (BETYPE) as needed. The default paths/names will be used in other scripts within LASER pipeline, so be sure to change them as well.
source feature0_preprocess.sh

This step will generate the following directory and files for each gene listed in gene_list_example.txt. The following example is shown for the case of mouse Myb gene with the base editor type set to evoA1.

  • data_mouse
    • Myb
      • ExonsSpreadsheet-Mus_musculus_Transcript_Exons_ENSMUST00000188495.8.csv
      • Mus_musculus_ENSMUST00000188495.8_withutr_100.fa
      • Myb_tiling_evoA1.csv
  • data_mouse_processed
    • Myb
      • Myb.fasta
      • Myb_tiling_evoA1.csv data_mouse/Myb.fasta will have the protein sequence of the target gene, and data_mouse_processed/Myb/Myb_tiling_evoA1.csv will contain the list of sgRNA sequences along with curated genomic information.

2.3. Predicting base editing outcomes

This step is designed to predict base editing outcomes using BE-HIVE for the sgRNA sequences curated in the prior step. Because BE-HIVE predictions take 5-10 minutes per gene, the script feature1_run_BEHIVE.sh is designed to be performed once per gene.

Run the following script feature1_run_BEHIVE.sh to

    1. Predict the base editing efficiency for each sgRNA
    1. Predict the distribution of edited outcomes for each sgRNA By default, this script assumes that both git repositories of BE-HIVEs are cloned at the home directory and uses evoA1. Otherwise, these paths must be specified manually within the script.
source feature1_run_BEHIVE.sh Myb

After this step, the following files will be generated under ``data_mouse_processed/{GENE}`.

  • behive_output_efficiency_evoA1.csv: This csv file contains input sequence (behive_input) and predicted editing efficiency for each sgRNA.
  • behive_output_bystander_evoA1.csv: This csv file contains input sequence (behive_input), edited outcome (Genotype), and corresponding probability scale (Predicted frequency) for each edited outcome.

Parallel Processing

The additional script batch_wrapper.sh supports running scripts in parallel by submitting each job on a remote LSF server as follows. Flags may need to be manipulated for your specific environment.

source batch_wrapper.sh behive

2.4. Generating MSAs (GPU Required)

This step is designed to generate multiple sequence alignments (MSAs) of the target genes in preparation for variant effect predictions. This step is required only if using the full-capacity LASER pipeline that includes TranceptEVE; otherwise, it can be skipped.

To generate MSAs, we used Colabfold API by running the following script.

source feature2_msa_generation.sh

After this step, the following MSA file will be generated under data_mouse_processed/{GENE}.

  • Colabfold_MSA/{GENE}.a3m.

You can use alternative MSA generation method such as JackHMMER and MMseqs2. If you do so, make sure to match the format to the example MSA file provided in this repository.

2.5. Processing data for variant effect predictions

This step is designed to process curated base editing information in preparation for the following variant effect predictions.

Run the following script feature3_processing.sh to

    1. Curate predicted base editing outcomes and efficiencies
    1. Curate resulting genotype and protein-level variants in preparation for variant effect predictions
source feature3_processing.sh

After this step, the following files will be generated under data_mouse_processed/{GENE}.

  • {GENE}_tiling_evoA1_bystanders.csv: CSV file containing information of each base editing outcome per row
  • MSA_for_EVE.fasta: Input file for training EVE
  • input_esm2_evoA1.csv: Input file for variant effect prediction using ESM2
  • input_TranceptEVE.csv: Input file for variant effect prediction using TranceptEVE
  • variants_{GENE}_evoA1_input.vcf: Input file for splicing effect prediction using Pangolin

2.6. Predicting the effects of splicing and missense mutations (GPU Required)

This step is designed to predict variant effects for splicing and missense mutations.

Run the following scripts to

    1. Train EVE model in preparation for running TranceptEVE
    1. Run Pangolin model to predict splicing variations upon base editing
    1. Run ESM2 model to predict functional variations of proteins upon base editing
    1. Run TranceptEVE to predict functional variations of proteins upon base editing
source feature4_run_EVE.sh Myb
source feature4_run_Pangolin.sh Myb
source feature4_run_ESM2.sh Myb
source feature4_run_TranceptEVE.sh Myb

After this step, the following files will be generated or modified in data_mouse_processed/{GENE}.

  • trained_models_EVE: Directory containing the EVE model
  • variants_{GENE}_evoA1_pangolin.vcf: Output file containing splicing variations predicted by Pangolin
  • input_esm2_evoA1.csv: File containing variant effects predicted by ESM2 within new columns.
  • Tranception_Large_retrieval_0.6_substitutions: Directory containing variant effects predicted by TranceptEVE

In addition, external web tools, GEMME and ConSurf, were used in our LASER pipeline. Store output files in data_mouse_processed/{GENE} as follows. Ensure the filenames and directory structure match exactly, or downstream processing will fail.

  • {GENE}
    • consurf_grades.txt
    • GEMME_results
      • normPred_evolCombi.txt

Parallel Processing

The additional script batch_wrapper.sh supports running scripts in parallel by submitting each job on a remote LSF server as follows. Flags may need to be manipulated for your specific environment.

source batch_wrapper.sh eve
source batch_wrapper.sh pangolin
source batch_wrapper.sh esm2
source batch_wrapper.sh trancepteve

2.7. Processing predictions

This step is designed to process and summarize predicted variant effects.

Run the following script.

source feature5_postprocessing.sh

After this step, the following files will be modified under data_mouse_processed/{GENE}.

  • {GENE}_tiling_evoA1_bystanders.csv: CSV file containing annotated predictions

3. Analysis

The notebook notebooks/analysis.ipynb is provided to analyze the processed sgRNA library. This notebook is desgined for the analysis using experimentally validated measurements, particularly sgRNA read counts from BE screening.

4. Citations

If you used LASER in your research, please cite the following paper.

Citation information will be added upon publication.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors