LASER is a pipeline for designing sgRNAs targeting missense mutations introduced by split-engineered base editors, assisted by large language models (LLMs) and machine learning (ML) models. The pipeline is currently being refined to improve performance and computational time, which will become available in future. Make sure to see the LICENSE before usage, as some models this repository relies on have specific licensing agreements.
Our LASER pipeline combines multiple machine-learning models, some of whose python and/or package dependencies conflict with each other. Therefore, we install some models in an independent virtual environment using pyenv and poetry/pipenv and call each environment as needed. General requirements are as follows.
- CUDA: 11.3
- pyenv: 2.3.36
- samtools: 2.10
The LASER pipeline was developed with python 3.8.12. It is highly recommended to use the same python version to avoid incompatibility of dependencies. Install the required python version as follows using pyenv.
pyenv install 3.8.12
pyenv local 3.8.12
Build the virtual environment as follows. The required packages are listed in the pyproject.toml file located in the main directory of this repository and can be installed with poetry.
python3 -m pip install poetry
python3 -m poetry install
python3 -m poetry run pip install torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
Set up Pangolin in the same virtual environment as follows.
cd ../
git clone https://github.com/tkzeng/Pangolin.git
cd laser
python3 -m poetry shell
poetry run pip install ../Pangolin
poetry run pip install setuptools=57.5.0
poetry run pip install pyvcf==0.6.8
Clone the necessary external repositories:
cd ../
git clone https://github.com/OATML-Markslab/EVE.git
git clone https://github.com/OATML-Markslab/Tranception.git
git clone https://github.com/maxwshen/be_predict_efficiency.git
git clone https://github.com/maxwshen/be_predict_bystander.git
Download pre-computed Tranception model checkpoint.
cd ../Tranception/
mkdir model
cd model
curl -o Tranception_Large_checkpoint.zip https://marks.hms.harvard.edu/tranception/Tranception_Large_checkpoint.zip
unzip Tranception_Large_checkpoint.zip
rm Tranception_Large_checkpoint.zip
Make sure that the project directories are on the same level as follows.
laserEVETranceptionmodelTranception_Large
Pangolinbe_predict_efficiencybe_predict_bystander
BE-HIVE models are required for the LASER pipeline, but their environment has some version conflicts with that of LASER. Install the required python version for BE-HIVE as follows using pyenv.
pyenv install 3.6.3
Build the virtual environment as follows. We provided requirements with poetry under the poetry_files directory of this repository.
cp laser/poetry_files/* ./be_predict_efficiency/
cd be_predict_efficiency
pyenv local 3.6,3
python3 -m pip install poetry
python3 -m poetry install
This repository supports sgRNA library design for split-engineered base editors and also post-hoc analysis of those sgRNA libraries to provide the predicted effect score (LASER score) for each sgRNA.
This step is designed to collect genomic information, such as primary sequences and exon positions, of the genes of interest. Set up the path, where you would like to save the primary sequence files and annotation files and the target species (only mouse or human is supported in this project). It is also recommended to set this path in your bash/zsh profile.
export DB=PATH_TO_DATABASE
export LASER_SPECIES=mouse
Run the following script download_database.sh to download primary assembly sequences and annotations of human and mouse genomes.
cd scripts
source download_database.sh
The above script will generate the following files under the directory you specified as DB.
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz: Human primary assembly from Ensemblgencode.v46.annotation.gtf.gz: Human genome annotations from GENCODEgencode.v46.ensembl.annotation.txt: Curated human genome annotations for Ensembl Canonical Transcripts.Mus_musculus.GRCm39.dna.primary_assembly.fa.gz: Mouse primary assembly from Ensemblgencode.vM36.annotation.gtf.gz: Mouse genome annotations from GENCODEgencode.vM36.ensembl.annotation.txt: Curated mouse genome annotations for Ensembl Canonical Transcripts.
Run the following script feature0_preprocess.sh to
-
- Curate genomic information of each gene of interest listed in
gene_list_example.txt(one gene per line).
- Curate genomic information of each gene of interest listed in
-
- Enumerate all possible sgRNA(NGG PAM) spacer sequences overlapping with the coding region of each gene of interest listed in
gene_list_example.txt.
- Enumerate all possible sgRNA(NGG PAM) spacer sequences overlapping with the coding region of each gene of interest listed in
-
- Curate sgRNA sequences and information for the downstream analysis. You may customize the output directory path (OUTPUTDIR) and base editor type (BETYPE) as needed. The default paths/names will be used in other scripts within LASER pipeline, so be sure to change them as well.
source feature0_preprocess.sh
This step will generate the following directory and files for each gene listed in gene_list_example.txt. The following example is shown for the case of mouse Myb gene with the base editor type set to evoA1.
data_mouseMybExonsSpreadsheet-Mus_musculus_Transcript_Exons_ENSMUST00000188495.8.csvMus_musculus_ENSMUST00000188495.8_withutr_100.faMyb_tiling_evoA1.csv
data_mouse_processedMybMyb.fastaMyb_tiling_evoA1.csvdata_mouse/Myb.fastawill have the protein sequence of the target gene, anddata_mouse_processed/Myb/Myb_tiling_evoA1.csvwill contain the list of sgRNA sequences along with curated genomic information.
This step is designed to predict base editing outcomes using BE-HIVE for the sgRNA sequences curated in the prior step. Because BE-HIVE predictions take 5-10 minutes per gene, the script feature1_run_BEHIVE.sh is designed to be performed once per gene.
Run the following script feature1_run_BEHIVE.sh to
-
- Predict the base editing efficiency for each sgRNA
-
- Predict the distribution of edited outcomes for each sgRNA By default, this script assumes that both git repositories of BE-HIVEs are cloned at the home directory and uses evoA1. Otherwise, these paths must be specified manually within the script.
source feature1_run_BEHIVE.sh Myb
After this step, the following files will be generated under ``data_mouse_processed/{GENE}`.
behive_output_efficiency_evoA1.csv: This csv file contains input sequence (behive_input) and predicted editing efficiency for each sgRNA.behive_output_bystander_evoA1.csv: This csv file contains input sequence (behive_input), edited outcome (Genotype), and corresponding probability scale (Predicted frequency) for each edited outcome.
Parallel Processing
The additional script batch_wrapper.sh supports running scripts in parallel by submitting each job on a remote LSF server as follows. Flags may need to be manipulated for your specific environment.
source batch_wrapper.sh behive
This step is designed to generate multiple sequence alignments (MSAs) of the target genes in preparation for variant effect predictions. This step is required only if using the full-capacity LASER pipeline that includes TranceptEVE; otherwise, it can be skipped.
To generate MSAs, we used Colabfold API by running the following script.
source feature2_msa_generation.sh
After this step, the following MSA file will be generated under data_mouse_processed/{GENE}.
Colabfold_MSA/{GENE}.a3m.
You can use alternative MSA generation method such as JackHMMER and MMseqs2. If you do so, make sure to match the format to the example MSA file provided in this repository.
This step is designed to process curated base editing information in preparation for the following variant effect predictions.
Run the following script feature3_processing.sh to
-
- Curate predicted base editing outcomes and efficiencies
-
- Curate resulting genotype and protein-level variants in preparation for variant effect predictions
source feature3_processing.sh
After this step, the following files will be generated under data_mouse_processed/{GENE}.
{GENE}_tiling_evoA1_bystanders.csv: CSV file containing information of each base editing outcome per rowMSA_for_EVE.fasta: Input file for training EVEinput_esm2_evoA1.csv: Input file for variant effect prediction using ESM2input_TranceptEVE.csv: Input file for variant effect prediction using TranceptEVEvariants_{GENE}_evoA1_input.vcf: Input file for splicing effect prediction using Pangolin
This step is designed to predict variant effects for splicing and missense mutations.
Run the following scripts to
-
- Train EVE model in preparation for running TranceptEVE
-
- Run Pangolin model to predict splicing variations upon base editing
-
- Run ESM2 model to predict functional variations of proteins upon base editing
-
- Run TranceptEVE to predict functional variations of proteins upon base editing
source feature4_run_EVE.sh Myb
source feature4_run_Pangolin.sh Myb
source feature4_run_ESM2.sh Myb
source feature4_run_TranceptEVE.sh Myb
After this step, the following files will be generated or modified in data_mouse_processed/{GENE}.
trained_models_EVE: Directory containing the EVE modelvariants_{GENE}_evoA1_pangolin.vcf: Output file containing splicing variations predicted by Pangolininput_esm2_evoA1.csv: File containing variant effects predicted by ESM2 within new columns.Tranception_Large_retrieval_0.6_substitutions: Directory containing variant effects predicted by TranceptEVE
In addition, external web tools, GEMME and ConSurf, were used in our LASER pipeline. Store output files in data_mouse_processed/{GENE} as follows. Ensure the filenames and directory structure match exactly, or downstream processing will fail.
{GENE}consurf_grades.txtGEMME_resultsnormPred_evolCombi.txt
Parallel Processing
The additional script batch_wrapper.sh supports running scripts in parallel by submitting each job on a remote LSF server as follows. Flags may need to be manipulated for your specific environment.
source batch_wrapper.sh eve
source batch_wrapper.sh pangolin
source batch_wrapper.sh esm2
source batch_wrapper.sh trancepteve
This step is designed to process and summarize predicted variant effects.
Run the following script.
source feature5_postprocessing.sh
After this step, the following files will be modified under data_mouse_processed/{GENE}.
{GENE}_tiling_evoA1_bystanders.csv: CSV file containing annotated predictions
The notebook notebooks/analysis.ipynb is provided to analyze the processed sgRNA library. This notebook is desgined for the analysis using experimentally validated measurements, particularly sgRNA read counts from BE screening.
If you used LASER in your research, please cite the following paper.
Citation information will be added upon publication.