Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
3e9a1ff
update doc layout and content
nathanielmki Jan 7, 2021
4dc1e79
update docs
nathanielmki Jan 11, 2021
84084fb
update ncbi doc
nathanielmki Jan 11, 2021
b73cc5a
formatting
nathanielmki Jan 11, 2021
61c5ecf
update summary and objectives
nathanielmki Jan 11, 2021
361fde9
formatting
nathanielmki Jan 11, 2021
6495200
update text
nathanielmki Jan 11, 2021
76179b8
update text
nathanielmki Jan 11, 2021
6cebdc3
update text
nathanielmki Jan 11, 2021
91daffd
update text
nathanielmki Jan 11, 2021
64999ed
spelling
nathanielmki Jan 11, 2021
c39eca0
text update
nathanielmki Jan 11, 2021
bf89dd6
add new ssh config doc
nathanielmki Jan 12, 2021
8b02399
update text
nathanielmki Jan 12, 2021
c4e151e
update content
nathanielmki Jan 12, 2021
b7e3417
update images
nathanielmki Jan 12, 2021
43f4348
update docs
nathanielmki Jan 22, 2021
79a8405
update wsl doc
nathanielmki Jan 22, 2021
d8c833f
update images, doc
nathanielmki Jan 26, 2021
965b5f9
update doc
nathanielmki Jan 27, 2021
12184e1
update formating
nathanielmki Jan 27, 2021
c7f42e8
update databases contetn
nathanielmki Feb 4, 2021
78b01d2
update wsl doc
nathanielmki Feb 8, 2021
5e75531
update docs
nathanielmki Feb 9, 2021
49c50dd
fix image paths
nathanielmki Mar 7, 2021
c91c623
add images, worksheet doc, (testing emoji embed :P)
nathanielmki Mar 7, 2021
4f16e44
add worksheet content
nathanielmki Mar 7, 2021
a9d3ec9
update assembly worksheet, add exploration doc
nathanielmki Mar 8, 2021
aae229f
update Step 4 with additional details
nathanielmki Mar 8, 2021
18b3192
complete assembly workshett, update exploration
nathanielmki Mar 8, 2021
bbaefc9
update language on assembly worksheet, update exploration
nathanielmki Mar 8, 2021
6129316
update exploration doc with missing options in JSON
nathanielmki Mar 8, 2021
a00cbaf
formatting
nathanielmki Mar 8, 2021
9e91caa
more formatting
nathanielmki Mar 8, 2021
c02fe5a
update doc
nathanielmki Mar 8, 2021
db74f5c
update worksheet docs
nathanielmki Mar 9, 2021
f48b738
update doc with reduced templates
nathanielmki Mar 10, 2021
72b0487
update Assembly doc with new paths
nathanielmki Mar 16, 2021
ff337d9
add new worksheet doc
nathanielmki Mar 30, 2021
d79caad
update doc
nathanielmki Mar 30, 2021
85fa2b1
update doc some more
nathanielmki Mar 30, 2021
4a71f80
add worksheet
nathanielmki Apr 5, 2021
af28b5b
update .DS_store
nathanielmki Jun 24, 2021
d5baf44
add new task list doc
nathanielmki Sep 16, 2021
40903cd
update task list doc with additional details
nathanielHeila Sep 26, 2021
e98b53d
update doc module 1
nathanielHeila Sep 26, 2021
1e2ebf3
update biocore_doc with cli content
nathanielHeila Oct 27, 2021
e640ff5
update cli exercises doc, add script file
nathanielHeila Nov 1, 2021
b4a7c6e
update docs
nathanielHeila Nov 1, 2021
28db9a2
reword SSH portion of cli orientation
nathanielHeila Nov 1, 2021
7ea2531
update orientation with tmux content
nathanielHeila Nov 1, 2021
0103d31
add day3 script
nathanielHeila Nov 3, 2021
a8b9b84
update script
nathanielHeila Nov 3, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified .DS_Store
Binary file not shown.
Binary file renamed minota/.DS_Store → bioinformatics/.DS_Store
Binary file not shown.
Binary file added bioinformatics/analysis/.DS_Store
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# MINOTA Workshop: Introduction to Quality Control
# Introduction to Quality Control

In this tutorial, we'll be covering the critical process of Quality Control.

Welcome to the MDIBL MINOTA Workshop. In this portion of the course, we'll be covering the critical process of Quality Control.
First, a review of some exploratory tools for both pre and post transcriptome analysis, and how they can provide you with an overview of your input data and output results, without having to sift through pages of text log files (as fun as that sounds).

Next, we're taking a look at a couple of software packages that do the heavy lifting, and are chiefly responsible for the Quality Control aspects of QC. Specifically Trimmomatic, in an integrated context within Trinity, and Trim Galore!; a wonderful piece of software built by the developers of FastQC (which is, coincidentally, half of it!).
Expand Down Expand Up @@ -97,7 +98,7 @@ First, you're going to want to fire up your favorite terminal, and ssh into your

It'll look something like this:

<img src="./../images/quality-control/ssh_1.png" width="800">
<img src="./intro_to_quality_control_img/ssh_1.png" width="800">

### QC Workflow Guided Run

Expand All @@ -121,7 +122,7 @@ Delete the text after `path:` under `seqfile:`.

Fill the empty `path:` with the file path we `ls`'d earlier:

<img src="./../images/quality-control/qc_workflow_edit.png" width="800">
<img src="./intro_to_quality_control_img/qc_workflow_edit.png" width="800">

**If you botch or delete something, and you don't remember what it was: in `nano`, use `control + x` on macOS / `ctrl + x` on Windows, followed by `n` and `enter`.**

Expand All @@ -131,7 +132,7 @@ To save your changes, use `control + o` and `enter`.

To execute the workflow in CWL, simply type the following on the command line:

<img src="./../images/quality-control/qc_workflow_run.png" width="800">
<img src="./intro_to_quality_control_img/qc_workflow_run.png" width="800">

Depending on the file you chose, it should process fairly quickly, depositing six named files in your directory: an html file,a zipped folder containing the raw contents of the initial FastQC report, a trimmed read file, a trimmed html file + zipped raw contents, and a trimming report text file.

Expand All @@ -143,15 +144,15 @@ For those of you may not have worked with FastQC reports before, I'll be going o

#### Basic Statistics

<img src="./../images/quality-control/basic_stats.png" width="800">
<img src="./intro_to_quality_control_img/basic_stats.png" width="800">

* Summary statistics of your input file
* File type, encoding, total sequence, sequence quality, length, and %GC
* Why does the sequence length vary between unrtrimmed/trimmed? Most likely due to adapters being removed, shortening the sequence and introducing variation

#### Per Base Sequence Quality

<img src="./../images/quality-control/per_base_seq_qual.png" width="800">
<img src="./intro_to_quality_control_img/per_base_seq_qual.png" width="800">

* BoxWhisker plot made up of
* central red line being the median value
Expand All @@ -166,23 +167,23 @@ For those of you may not have worked with FastQC reports before, I'll be going o

#### Per Tile Sequence Quality

<img src="./../images/quality-control/per_tile_seq_qual.png" width="800">
<img src="./intro_to_quality_control_img/per_tile_seq_qual.png" width="800">

* Shows deviations from average quality for each tile
* Blue = positions where quality was at or above average for that base in the run
* Red = worse quality

#### Per Sequence Quality Scores

<img src="./../images/quality-control/per_seq_qual_score.png" width="800">
<img src="./intro_to_quality_control_img/per_seq_qual_score.png" width="800">

* Shows you if a subset of your sequences contain universally low quality values
* If a large amount of sequences in a run have an overall low quality, may point to a systemic problem (either entirely, or a portion) with the run itself
* Errors will arise when a general loss of quality within a run is encountered

#### Per Base Sequence Content

<img src="./../images/quality-control/per_base_seq_con.png" width="800">
<img src="./intro_to_quality_control_img/per_base_seq_con.png" width="800">

* Plots out proportion of each base position in a file, where normal DNA bases are called
* There should be little to no difference, and the lines should run close to one another (should not be massively imbalanced)
Expand All @@ -196,29 +197,29 @@ For those of you may not have worked with FastQC reports before, I'll be going o

#### Per Sequence GC Content

<img src="./../images/quality-control/per_seq_gc_con.png" width="800">
<img src="./intro_to_quality_control_img/per_seq_gc_con.png" width="800">

* Measures GC content across entire length of every sequence, compares to a normal distribution model of GC content
* Sharp peaks in measured may indicate a contaminated library
* Distribution shift may point to an existing systemic bias (independent of base position)

#### Per Base N Content

<img src="./../images/quality-control/per_base_N_con.png" width="800">
<img src="./intro_to_quality_control_img/per_base_N_con.png" width="800">

* When a base call is unable to be made by a sequencer, an N is put in place of the normal base
* If there are a significant portion of per-base N content, it suggests that the pipeline was not able to conduct valid base calls

#### Sequence Length Distribution

<img src="./../images/quality-control/seq_len_dist.png" width="800">
<img src="./intro_to_quality_control_img/seq_len_dist.png" width="800">

* Graphs the distribution of fragment sizes in the sequence file analyzed
* Some sequencers generate fragments with uniform lengths; even so, after trimming the uniformity will be broken and variation in length introduced

#### Sequence Duplication Levels

<img src="./../images/quality-control/seq_dup_lev.png" width="800">
<img src="./intro_to_quality_control_img/seq_dup_lev.png" width="800">

* When working with a diverse library, most sequences should occur only once in the final set
* low levels of duplication can point to a high amount of coverage of the target sequence
Expand All @@ -235,14 +236,14 @@ For those of you may not have worked with FastQC reports before, I'll be going o

#### Overrepresented Sequences

<img src="./../images/quality-control/ovr_rep_seq.png" width="800">
<img src="./intro_to_quality_control_img/ovr_rep_seq.png" width="800">

* Lists all sequences making up more that 0.1% of total, though to conserve memory, only ones that appear in first 100,000 sequencs are tracked
* For every overrepresented sequence, FastQC looks for matches in a database of common contaminants, reporting best found hits

### Adapter Content

<img src="./../images/quality-control/adp_con.png" width="800">
<img src="./intro_to_quality_control_img/adp_con.png" width="800">

* Picks up wether your library has a significant amount of adapter sequence, and whether you need to conducting trimming.
* Plot shows cumulative percentage count of the proportion of the library that has seen each adapter sequence at each position
Expand Down
32 changes: 32 additions & 0 deletions bioinformatics/online_resources/intro_to_databases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
title: Introduction to Databases
author: "Nathaniel Maki"
organization: MDIBL Computational Core
date: "January 20th"
---

# Introduction to Databases

## Learning Objectives

* Learn the differences between Primary and Secondary databases
* Exposure to the wide range of databases available for exploration
* Become familiar with standard use cases for a selection of sites covered

## Summary

Commonly, databases are characterized as either primary or secondary, and this holds true for bioinformatics as it does for other data-rich fields

**Primary databases** are comprised of data that has been experimentally derived, with the results being uploaded directly into the database by researchers

* In our domain for example, the information that is archived is made up of content such as nucleotide or protein sequence, or macromolecular structure
* Once assigned an accession number, the data stored within a primary database becomes static, and is designated a Record
* Ex: GenBank, ENA, GEO

**Secondary databases** could be considered an "extension" of primary ones, due to their makeup being derived from the analysis of primary data

* Pull from multiple sources, such as other databases, and available scientific literature
* These resources are incredibly complex, combining manual and computational analysis/interpretation, and are very highly curated
* Their primary purpose is to exist as vast repositories of reference material, with detailed data ranging from single genes, to complete and published experimental results
* Ex: Uniprot, Ensembl, InterPro

113 changes: 113 additions & 0 deletions bioinformatics/online_resources/intro_to_ensembl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
title: Introduction to Ensembl
author: "Nathaniel Maki"
organization: MDIBL Computational Core
date: "January 24th"
---

# Introduction to Ensembl

## Summary

* Ensembl is a genome browser, acting as a vast repository of reference genomes and annotations for a wide range of organisms, including Human, Mouse, C. Elegans, and Zebrafish
* Mostly dedicated to model organisms, but does contain resources for a number of non-model species
* Primarily focused on vertebrates, Ensembl Genomes extends across to non-vertebrates, and includes Plants, Fungi, and Bacteria

Ensembl annotates a large swath of data onto its genome assemblies, first type is Gene Models(builds)

## Gene Models

Comprised of:

International Nucleotide Sequence databases (ENA, GenBank, DDBJ)
* cDNAs
* ESTs
* RNAseq

NCBI RefSeq
* Manually annotated proteins and MRNAs

Protein Sequence databases
* Swiss-Prot

Sequences from the above resources are aligned to the genome, transcripts clustered from alignments based on overlapping coding sequences

Forms Ensembl genes (automated genome annotation pipeline)

Ensembl genomes can either be automatically or manually annotated (HAVANA for manual)
* Set of genes is known as the Gencode geneset

In addition to gene annotation, other data types are added to genome, including variation data, comparative genomics, and regulatory features (which we'll touch on later)

## Querying

Choosing human genome build, and search for `tp53`

* links on the left of the page show specific information related to the TP53 gene

### Summary

* Gene has 27 transcripts annotated, 312 orthologues, 2 paralogues
* `Show transcript table` gives us detailed information regarding the Gene and it's associated transcripts
* Transcript ID
* Biotype
* CCDS (Consensus Coding DNA seq set)
* Uniprot Match - Link to Protein transcript entry

#### Gene Track

* Blue bar = Contigs (sequence of overlapping reads)
* Transcripts above contig are on the forward strand, below it they're on the reverse
* Boxes are exons, lines which connect them are the introns
* Filled in boxes contain coding sequence, unfilled represent untranslated regions
* Red = Ensembl Protein coding (annotated by Ensembl automated)
* Gold = merged Ensembl/Havana (annotated by Ensembl automated + Havana manual annotation)
* Blue = Processed Transcript
* Regulation
* Dark Salmon = Promoter
* Light Salmon = Promoter Flank
* Pink = Transcription Factor Binding Site
* Cyan = CTCF

* Selecting a Transcript
* Click box of choosing and select the transcript ID
* Can examine supporting evidence
* Protein Information (reference UniProt)

* Region in detail
* Selection Location
* Top of page is chromosomal overview, red box denotes region of chromosome where other views on page focus on
* Red box in Detail highlights 1mb overview of TP53
* Scrolling further down shows most detailed location of TP53
* Tracks can be formatted, added, removed, etc
* Gear icon (configure) lets you add additional tracks

### Comparative Genomics

Allows you to compare Gene against multiple alignments, Gene Trees, Orthologous and Paralogues
* Gene Ortholog - homomlogous genes that diverged following evolution giving rise to new species, maintain similar function to precursor gene
* Originate from speciation event
* Gene Paralog - homologous genes that diverged within a species, a new gene that upholds a new function
* Come into existence during gene duplication, where a copy of the gene obtains a mutation -> new gene with new function

#### Alignments

* Pairwise - meaning two sequences at a time
* Multiple - more than two (attempt to align all sequences within a query set)

Can choose many sequences to potentially align to
* Examine full map to see areas of similar sequence
* Also look at high quality assemblies compared to low quality

#### Gene Tree
* Relation of gene between species, includes homologs

#### Orthologous
* Lists gene orthologous

#### Paralagous
* Lists gene paralogous




Loading