mdibl · nathanielmki · Jan 7, 2021 · Jan 11, 2021 · Jan 11, 2021 · Jan 11, 2021
diff --git a/.DS_Store b/.DS_Store
diff --git a/minota/.DS_Store → bioinformatics/.DS_Store b/minota/.DS_Store → bioinformatics/.DS_Store
diff --git a/bioinformatics/analysis/.DS_Store b/bioinformatics/analysis/.DS_Store
diff --git a/minota/markdown/minota-quality-control.md → ...tics/analysis/intro_to_quality_control.md b/minota/markdown/minota-quality-control.md → ...tics/analysis/intro_to_quality_control.md
@@ -1,6 +1,7 @@
-# MINOTA Workshop: Introduction to Quality Control
+# Introduction to Quality Control
+
+In this tutorial, we'll be covering the critical process of Quality Control.
 
-Welcome to the MDIBL MINOTA Workshop. In this portion of the course, we'll be covering the critical process of Quality Control.
 First, a review of some exploratory tools for both pre and post transcriptome analysis, and how they can provide you with an overview of your input data and output results, without having to sift through pages of text log files (as fun as that sounds).
 
 Next, we're taking a look at a couple of software packages that do the heavy lifting, and are chiefly responsible for the Quality Control aspects of QC. Specifically Trimmomatic, in an integrated context within Trinity, and Trim Galore!; a wonderful piece of software built by the developers of FastQC (which is, coincidentally, half of it!).
@@ -97,7 +98,7 @@ First, you're going to want to fire up your favorite terminal, and ssh into your
 
 It'll look something like this:
 
-<img src="./../images/quality-control/ssh_1.png" width="800">
+<img src="./intro_to_quality_control_img/ssh_1.png" width="800">
 
 ### QC Workflow Guided Run
 
@@ -121,7 +122,7 @@ Delete the text after `path:` under `seqfile:`.
 
 Fill the empty `path:` with the file path we `ls`'d earlier: 
 
-<img src="./../images/quality-control/qc_workflow_edit.png" width="800">
+<img src="./intro_to_quality_control_img/qc_workflow_edit.png" width="800">
 
 **If you botch or delete something, and you don't remember what it was: in `nano`, use `control + x` on macOS / `ctrl + x` on Windows, followed by `n` and `enter`.**
 
@@ -131,7 +132,7 @@ To save your changes, use `control + o` and `enter`.
 
 To execute the workflow in CWL, simply type the following on the command line:
 
-<img src="./../images/quality-control/qc_workflow_run.png" width="800">
+<img src="./intro_to_quality_control_img/qc_workflow_run.png" width="800">
 
 Depending on the file you chose, it should process fairly quickly, depositing six named files in your directory: an html file,a zipped folder containing the raw contents of the initial FastQC report, a trimmed read file, a trimmed html file + zipped raw contents, and a trimming report text file.
 
@@ -143,15 +144,15 @@ For those of you may not have worked with FastQC reports before, I'll be going o
 
 #### Basic Statistics
 
-<img src="./../images/quality-control/basic_stats.png" width="800">
+<img src="./intro_to_quality_control_img/basic_stats.png" width="800">
 
 * Summary statistics of your input file
 * File type, encoding, total sequence, sequence quality, length, and %GC
   * Why does the sequence length vary between unrtrimmed/trimmed? Most likely due to adapters being removed, shortening the sequence and introducing variation
 
 #### Per Base Sequence Quality
 
-<img src="./../images/quality-control/per_base_seq_qual.png" width="800">
+<img src="./intro_to_quality_control_img/per_base_seq_qual.png" width="800">
 
 * BoxWhisker plot made up of
   * central red line being the median value
@@ -166,23 +167,23 @@ For those of you may not have worked with FastQC reports before, I'll be going o
 
 #### Per Tile Sequence Quality
 
-<img src="./../images/quality-control/per_tile_seq_qual.png" width="800">
+<img src="./intro_to_quality_control_img/per_tile_seq_qual.png" width="800">
 
 * Shows deviations from average quality for each tile
 * Blue = positions where quality was at or above average for that base in the run
 * Red = worse quality
 
 #### Per Sequence Quality Scores
 
-<img src="./../images/quality-control/per_seq_qual_score.png" width="800">
+<img src="./intro_to_quality_control_img/per_seq_qual_score.png" width="800">
 
 * Shows you if a subset of your sequences contain universally low quality values
 * If a large amount of sequences in a run have an overall low quality, may point to a systemic problem (either entirely, or a portion) with the run itself
 * Errors will arise when a general loss of quality within a run is encountered
 
 #### Per Base Sequence Content
 
-<img src="./../images/quality-control/per_base_seq_con.png" width="800">
+<img src="./intro_to_quality_control_img/per_base_seq_con.png" width="800">
 
 * Plots out proportion of each base position in a file, where normal DNA bases are called
 * There should be little to no difference, and the lines should run close to one another (should not be massively imbalanced)
@@ -196,29 +197,29 @@ For those of you may not have worked with FastQC reports before, I'll be going o
 
 #### Per Sequence GC Content
 
-<img src="./../images/quality-control/per_seq_gc_con.png" width="800">
+<img src="./intro_to_quality_control_img/per_seq_gc_con.png" width="800">
 
 * Measures GC content across entire length of every sequence, compares to a normal distribution model of GC content
 * Sharp peaks in measured may indicate a contaminated library
 * Distribution shift may point to an existing systemic bias (independent of base position)
 
 #### Per Base N Content
 
-<img src="./../images/quality-control/per_base_N_con.png" width="800">
+<img src="./intro_to_quality_control_img/per_base_N_con.png" width="800">
 
 * When a base call is unable to be made by a sequencer, an N is put in place of the normal base
 * If there are a significant portion of per-base N content, it suggests that the pipeline was not able to conduct valid base calls
 
 #### Sequence Length Distribution
 
-<img src="./../images/quality-control/seq_len_dist.png" width="800">
+<img src="./intro_to_quality_control_img/seq_len_dist.png" width="800">
 
 * Graphs the distribution of fragment sizes in the sequence file analyzed
 * Some sequencers generate fragments with uniform lengths; even so, after trimming the uniformity will be broken and variation in length introduced
 
 #### Sequence Duplication Levels
 
-<img src="./../images/quality-control/seq_dup_lev.png" width="800">
+<img src="./intro_to_quality_control_img/seq_dup_lev.png" width="800">
 
 * When working with a diverse library, most sequences should occur only once in the final set
   * low levels of duplication can point to a high amount of coverage of the target sequence
@@ -235,14 +236,14 @@ For those of you may not have worked with FastQC reports before, I'll be going o
 
 #### Overrepresented Sequences
 
-<img src="./../images/quality-control/ovr_rep_seq.png" width="800">
+<img src="./intro_to_quality_control_img/ovr_rep_seq.png" width="800">
 
 * Lists all sequences making up more that 0.1% of total, though to conserve memory, only ones that appear in first 100,000 sequencs are tracked
 * For every overrepresented sequence, FastQC looks for matches in a database of common contaminants, reporting best found hits
 
 ### Adapter Content
 
-<img src="./../images/quality-control/adp_con.png" width="800">
+<img src="./intro_to_quality_control_img/adp_con.png" width="800">
 
 * Picks up wether your library has a significant amount of adapter sequence, and whether you need to conducting trimming.
 * Plot shows cumulative percentage count of the proportion of the library that has seen each adapter sequence at each position

diff --git a/minota/images/quality-control/.DS_Store → ...is/intro_to_quality_control_img/.DS_Store b/minota/images/quality-control/.DS_Store → ...is/intro_to_quality_control_img/.DS_Store
diff --git a/minota/images/quality-control/adp_con.png → .../intro_to_quality_control_img/adp_con.png b/minota/images/quality-control/adp_con.png → .../intro_to_quality_control_img/adp_con.png
diff --git a/...ta/images/quality-control/basic_stats.png → ...ro_to_quality_control_img/basic_stats.png b/...ta/images/quality-control/basic_stats.png → ...ro_to_quality_control_img/basic_stats.png
diff --git a/...ta/images/quality-control/ovr_rep_seq.png → ...ro_to_quality_control_img/ovr_rep_seq.png b/...ta/images/quality-control/ovr_rep_seq.png → ...ro_to_quality_control_img/ovr_rep_seq.png
diff --git a/...images/quality-control/per_base_N_con.png → ...to_quality_control_img/per_base_N_con.png b/...images/quality-control/per_base_N_con.png → ...to_quality_control_img/per_base_N_con.png
diff --git a/...ages/quality-control/per_base_seq_con.png → ..._quality_control_img/per_base_seq_con.png b/...ages/quality-control/per_base_seq_con.png → ..._quality_control_img/per_base_seq_con.png
diff --git a/...ges/quality-control/per_base_seq_qual.png → ...quality_control_img/per_base_seq_qual.png b/...ges/quality-control/per_base_seq_qual.png → ...quality_control_img/per_base_seq_qual.png
diff --git a/...images/quality-control/per_seq_gc_con.png → ...to_quality_control_img/per_seq_gc_con.png b/...images/quality-control/per_seq_gc_con.png → ...to_quality_control_img/per_seq_gc_con.png
diff --git a/...es/quality-control/per_seq_qual_score.png → ...uality_control_img/per_seq_qual_score.png b/...es/quality-control/per_seq_qual_score.png → ...uality_control_img/per_seq_qual_score.png
diff --git a/...ges/quality-control/per_tile_seq_qual.png → ...quality_control_img/per_tile_seq_qual.png b/...ges/quality-control/per_tile_seq_qual.png → ...quality_control_img/per_tile_seq_qual.png
diff --git a/...ages/quality-control/qc_workflow_edit.png → ..._quality_control_img/qc_workflow_edit.png b/...ages/quality-control/qc_workflow_edit.png → ..._quality_control_img/qc_workflow_edit.png
diff --git a/...mages/quality-control/qc_workflow_run.png → ...o_quality_control_img/qc_workflow_run.png b/...mages/quality-control/qc_workflow_run.png → ...o_quality_control_img/qc_workflow_run.png
diff --git a/...ta/images/quality-control/seq_dup_lev.png → ...ro_to_quality_control_img/seq_dup_lev.png b/...ta/images/quality-control/seq_dup_lev.png → ...ro_to_quality_control_img/seq_dup_lev.png
diff --git a/...a/images/quality-control/seq_len_dist.png → ...o_to_quality_control_img/seq_len_dist.png b/...a/images/quality-control/seq_len_dist.png → ...o_to_quality_control_img/seq_len_dist.png
diff --git a/minota/images/quality-control/ssh_1.png → ...is/intro_to_quality_control_img/ssh_1.png b/minota/images/quality-control/ssh_1.png → ...is/intro_to_quality_control_img/ssh_1.png
diff --git a/bioinformatics/online_resources/intro_to_databases.md b/bioinformatics/online_resources/intro_to_databases.md
@@ -0,0 +1,32 @@
+---
+title: Introduction to Databases
+author: "Nathaniel Maki"
+organization: MDIBL Computational Core
+date: "January 20th"
+---
+
+# Introduction to Databases
+
+## Learning Objectives
+
+* Learn the differences between Primary and Secondary databases
+* Exposure to the wide range of databases available for exploration
+* Become familiar with standard use cases for a selection of sites covered
+
+## Summary
+
+Commonly, databases are characterized as either primary or secondary, and this holds true for bioinformatics as it does for other data-rich fields
+
+**Primary databases** are comprised of data that has been experimentally derived, with the results being uploaded directly into the database by researchers
+
+* In our domain for example, the information that is archived is made up of content such as nucleotide or protein sequence, or macromolecular structure
+* Once assigned an accession number, the data stored within a primary database becomes static, and is designated a Record
+  * Ex: GenBank, ENA, GEO
+
+**Secondary databases** could be considered an "extension" of primary ones, due to their makeup being derived from the analysis of primary data
+
+* Pull from multiple sources, such as other databases, and available scientific literature
+* These resources are incredibly complex, combining manual and computational analysis/interpretation, and are very highly curated
+* Their primary purpose is to exist as vast repositories of reference material, with detailed data ranging from single genes, to complete and published experimental results
+  * Ex: Uniprot, Ensembl, InterPro
+
diff --git a/bioinformatics/online_resources/intro_to_ensembl.md b/bioinformatics/online_resources/intro_to_ensembl.md
@@ -0,0 +1,113 @@
+---
+title: Introduction to Ensembl
+author: "Nathaniel Maki"
+organization: MDIBL Computational Core
+date: "January 24th"
+---
+
+# Introduction to Ensembl
+
+## Summary
+
+* Ensembl is a genome browser, acting as a vast repository of reference genomes and annotations for a wide range of organisms, including Human, Mouse, C. Elegans, and Zebrafish
+* Mostly dedicated to model organisms, but does contain resources for a number of non-model species
+* Primarily focused on vertebrates, Ensembl Genomes extends across to non-vertebrates, and includes Plants, Fungi, and Bacteria
+
+Ensembl annotates a large swath of data onto its genome assemblies, first type is Gene Models(builds)
+
+## Gene Models
+
+Comprised of: 
+
+International Nucleotide Sequence databases (ENA, GenBank, DDBJ)
+* cDNAs
+* ESTs
+* RNAseq
+
+NCBI RefSeq
+* Manually annotated proteins and MRNAs
+
+Protein Sequence databases
+* Swiss-Prot
+
+Sequences from the above resources are aligned to the genome, transcripts clustered from alignments based on overlapping coding sequences
+
+Forms Ensembl genes (automated genome annotation pipeline)
+
+Ensembl genomes can either be automatically or manually annotated (HAVANA for manual)
+* Set of genes is known as the Gencode geneset
+
+In addition to gene annotation, other data types are added to genome, including variation data, comparative genomics, and regulatory features (which we'll touch on later)
+
+## Querying
+
+Choosing human genome build, and search for `tp53`
+
+* links on the left of the page show specific information related to the TP53 gene
+
+### Summary
+
+* Gene has 27 transcripts annotated, 312 orthologues, 2 paralogues
+* `Show transcript table` gives us detailed information regarding the Gene and it's associated transcripts
+  * Transcript ID
+  * Biotype
+  * CCDS (Consensus Coding DNA seq set)
+  * Uniprot Match - Link to Protein transcript entry
+
+#### Gene Track
+
+* Blue bar = Contigs (sequence of overlapping reads)
+  * Transcripts above contig are on the forward strand, below it they're on the reverse
+  * Boxes are exons, lines which connect them are the introns
+    * Filled in boxes contain coding sequence, unfilled represent untranslated regions
+    * Red = Ensembl Protein coding (annotated by Ensembl automated)
+    * Gold = merged Ensembl/Havana (annotated by Ensembl automated + Havana manual annotation)
+    * Blue = Processed Transcript
+* Regulation
+  * Dark Salmon = Promoter
+  * Light Salmon = Promoter Flank
+  * Pink = Transcription Factor Binding Site
+  * Cyan = CTCF
+
+* Selecting a Transcript
+  * Click box of choosing and select the transcript ID
+  * Can examine supporting evidence
+  * Protein Information (reference UniProt)
+
+* Region in detail
+  * Selection Location
+  * Top of page is chromosomal overview, red box denotes region of chromosome where other views on page focus on
+  * Red box in Detail highlights 1mb overview of TP53
+  * Scrolling further down shows most detailed location of TP53
+    * Tracks can be formatted, added, removed, etc
+    * Gear icon (configure) lets you add additional tracks
+
+### Comparative Genomics
+
+Allows you to compare Gene against multiple alignments, Gene Trees, Orthologous and Paralogues
+* Gene Ortholog - homomlogous genes that diverged following evolution giving rise to new species, maintain similar function to precursor gene
+  * Originate from speciation event
+* Gene Paralog - homologous genes that diverged within a species, a new gene that upholds a new function
+  * Come into existence during gene duplication, where a copy of the gene obtains a mutation -> new gene with new function
+
+#### Alignments
+
+* Pairwise - meaning two sequences at a time
+* Multiple - more than two (attempt to align all sequences within a query set)
+
+Can choose many sequences to potentially align to
+* Examine full map to see areas of similar sequence
+* Also look at high quality assemblies compared to low quality
+
+#### Gene Tree
+* Relation of gene between species, includes homologs
+
+#### Orthologous
+* Lists gene orthologous
+
+#### Paralagous 
+* Lists gene paralogous
+
+
+
+