Skip to content

apinto17/Crispr-Screening-Simulator

Repository files navigation

Overview

This pipeline is used to find the most abundant genes that contain motifs that may be associated with gamma ray resistance. All the motifs can be found in motif_list.txt and relevant clinical data can be found in clinical_data.txt.

This pipeline identifies the most abundunt genes and then simulates pre/post crispr knockout, using a suitable crispr site for knockout. Afterwards, this data is compiled into a report.

TO RUN:

pre-requisites

  • Make sure each file is executable with chmod +x filename
  • Make sure you have the following files in the same directory you are running the files
    • clinical_data.txt
    • motif_list.txt
    • exomes/*.fasta
  • Make sure python is installed

Run in this order:

copyExomes.sh

Reads in the clinical data file and identifies the samples that have a diameter between 20 and 30 mm long (inclusive) and have had their genomes sequenced. Copy the identified exomes using the sample code names to a new directory called exomesCohort.

This is the only file that requires a parameter, run it like this:

./copyExomes.sh clinical_data.txt

createCrisprReady.sh

Using the motif_list.txt file, identifies the 3 highest occurring motifs in each exome inside the exomesCohort folder. Output the headers and corresponding sequences to a new file called {exomename}_topmotifs.fasta.

./createCrisprReady.sh

identifyCrisprSite.sh

For each gene inside the exomename_topmotifs.fasta files, this script identifies a suitable CRISPR site. Finds sequences that contain “NGG”, where “N” can be any base, that has at least 20 basepairs upstream. Example of upstream: ATGAACGTCTGTAAGAACTGCGGATCTGTCA (Everything left of CGG is upstream of the DNA) Output suitable candidates (headers and sequences) to a new file called {exomename}_precrispr.fasta

./identifyCrisprSite.sh

editGenome.sh

Using those files, this script that will insert the letter A right before the NGG site. Output to a new file called {exomename}_postcrispr.fasta. This is simulating a singular succesful crispr edit.

./editGenome.sh

exomeReport.py

This python script that will generate a single report that summarizes the findings. It is a text file that lists the name of the discoverer of the organism, the diameter, the code name, and the environment it came from. The next sentence will be where the file can be located on the server, and it prints out the first FASTA block of the file (ie just the first header and sequence).

Organism CODENAME, discovered by DISCOVERER, has a diameter of DIAMETER, and from the environment ENVIRONMENT.

The list of genes can be found in: ./some_path_crispr/codename_postcrispr.fasta

The first sequence of CODENAME is:

Gene0123

ATACGTACGGATCTATTT

python exomeReport.py

About

A pipeline that finds the most abundant genes that contain a motif, then simulates a crispr edit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors