Lab_notebook

# Start with this command
# python script.py file > output

# python script to parse out headers 
import sys, re

fd = open(sys.argv[1], "r")
for line in fd:
#Malaspina
#       line = line[7:13]
#OSD
#       line = re.search('OSD(\d)+_', line).group(0)[:-1]
#TARA
        line = re.search('TARA_(\d)+_(\D)+_(.)+-(.)+_', line).group(0)[:-1]
#GOS
#        line = re.search('GS(\d)+(\D)*-', line).group(0)[:-1]
        sys.stdout.write(line+'\n')
#####################################################################

Outlook:

TARA may not of worked, re-run tomorrow

21.9.2017

Objectives

Finish parsing out headers from ORF file
learn more about database management (R data.table, SQLite)
Read:
- ~/Desktop/msc/lit/steinegger2016.pdf
- ~/Desktop/msc/lit/hauser2016.pdf
- ~/Desktop/msc/lit/uniclust.pdf
20 min of python code academy

Notes

steinegger2016.pdf:
- MMSeq2 can potentially increase the annotated fraction of metagenomic analyis
- central step to metagenomic analysis is the annotation of ORFs by searching for similar protein sequences in databases
- BLAST uses see-and-extend approach with kmers which are matched and extended to a full gap alignment
- a large fraction of species found in metagenomic data may be distantly related to a well-known annotated genome, thus a large proportion of metagenomic sequences may be left unannotated
- MMSeq2 is for querying protein sequence... MMSeq (hauser2016.pdf) has the clustering function!
hauser2016.pdf:
- MMSeq -> deep clustering of large datasets
- three modules
  1. pre-filtering -> kmer based similarity score between all "query sequences" and all "target sequences"
  2. alignment -> Smith-Waterman alignment
  3. clustering ->
- databases such as UniProt are becoming large and have redundant sequences -> a solution to this is to calculate a representative subset of sequences by clustering them by their similarity
uniclust.pdf:
- due to the rapidly increases number of protein sequences in databases such as UniProt, it is more efficient to work with representative subsets of sequences rather than the total amount to lower computational time
- cluster evaluation: functional homogeneity of clusters assessed by comparing Gene Ontology terms
Redoing (Parse names of samples of ORF database and bind them to sample metadata) with BASH commands

# parsing heads from GOS 
sed -e 's/-/\t/' GS.headers | cut -f1 | sort -u --parallel 32 -S20% > GS_parsed.headers

# parsing headers from Malaspina

cut -c 8- ICM.headers | sed 's/_/\t/g' | cut -f1 | sort -u --parallel 32 -S20% > ICM_parsed.headers
# cut -> remove first 8 characters of the file
# sed -> replace first "_" with "\t"
# cut -> cut out field 1
# sort -> sort file so that only one copy of each header is left... parallel -> do this on 32 cores... -> S20% using 20% of the memory

# parsing headers from OSD

sed 's/_spades/\t/g' OSD.headers | cut -f1 | sort -u --parallel 32 -S20% > OSD_parsed.headers
# sed -> replace "_spades" with "\t"
# cut -> cut out field 1
# sort -> sort file so that only one copy of each header is left... parallel -> do this on 32 cores... -> S20% using 20% of the memory

# parsing headers from TARA

cut -f1-4 -d '_' TARA.headers | sort -u --parallel 32 -S20% > TARA_parsed.headers
# cut -> cut out fields 1-4 but file is now delimited by '_' (essentially cut at the 4th '_'
# sort -> sort file so that only one copy of each header is left... parallel -> do this on 32 cores... -> S20% using 20% of the memory

There may be issues with the Malaspina data headers from the ORF file... some headers are different from the metadata file (i.e. usually MP(\d){4} but ORF file has MP427.0)
In the TARA data, there are multiple samples with the same sample name. It appears that many samples were taken at the same site in the beginning of the expedition to test methods.
worked on data wrangling the contextual data

Outlook

22.9.2017

Objectives

Finish cleaning up the ORF headers and contextual data
Map samples to world map with ggplot
Make a box plot for environmental parameters and explore the data

Notes

Outlook

continue reading literature referenced in Chiara's wiki
Start to explore the protein clusters and see how they react to different environmental parameters

25.9.2017

Objectives

20 min of python academy
[] read:
- ~/Desktop/msc/lit/pfam.pdf

Notes

pfam.pdf:
- protein families are built on reference proteomes
NEW OFFICE!
colored sample map based on salinity and temperature
- the low salinity areas tend to be near the costs due to possible fresh water input
- temperature corrolates with latitude
worked on folder structure for clustering data

Outlook

finish reading pfam.pdf
finish making graphic for file structure wiki
start working with Chiara on hmmer script

26.9.2017

Objectives

[] 20 min of python academy
[] read:
- ~/Desktop/msc/lit/pfam.pdf
- ~/Desktop/msc/lit/CulturingBacteria.pdf
[] finish graphic for file stucture and write notes about it on wiki
[] hmmer script with Chiara

Notes

script for removing rejected sequences from clusters

#!/bin/bash
# remove rejected sequences from cluster files

# print out each step of the script was it executes
set -x
set -e

# set path variables 
EU=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_categories/EU_clusters.tsv
rejected=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_eval_stats/eval_results/msa_based_eval/rejected.tsv

# cut out field 2 (sequence headers) from the rejected sequences file 
cut -f2 $rejected > temp1

# inverse file grep rejected sequences against environmental unknown clusters
grep -v -f temp1 $EU > /scratch/mschecht/Environmental_unknowns_norejects.tsv

rm temp1

another version of the script using join

#!/bin/bash

# rejected2.sh
# script using join to remove the rejected sequenced from environmental unknown clusters

# print out each step of the script was it executes
set -x
set -e

# set path variables
EU=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_categories/EU_clusters.tsv
GU=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_categories/GU_clusters.tsv
Knowns=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_categories/K_clusters.tsv
Discarded=/bioinf/project/megx/UNKNOWNS/2017_09/cluster_categories/Discarded_clusters.tsv
rejected=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_eval_stats/eval_results/msa_based_eval/rejected.tsv

# join
#	-1 2: file 1 field 2
# 	-2 4: file 2 field 4
#	<(sort -k2,2 $rejected): process subsitution, sort field 2!
#	<(sort -k4,4 $EU): process substitution, sort field 4!
join -1 2 -2 4 -v2 <(sort -k2,2 $rejected) <(sort -k4,4 $EU) > EU_clusters_no_rejects.tsv

# genomic unknowns only has 3 columns (cluster#, representative, sequence header)
join -1 2 -2 4 -v2 <(sort -k2,2 $rejected) <(sort -k3,3 $GU) > GU_clusters_no_rejects.tsv

join -1 2 -2 4 -v2 <(sort -k2,2 $rejected) <(sort -k4,4 $Knowns) > K_clusters_no_rejects.tsv

join -1 2 -2 4 -v2 <(sort -k2,2 $rejected) <(sort -k4,4 $Discarded) > Discarded_clusters_no_rejects.tsv

Outlook

write more in the file structure wiki
continue working on the hmmer script with Chiara, need to test the script on a larger number of FASTA files,
- hmmcalibrate -> either -Eml or -O ... check this
work on lab rotation poster!!!

27.9.2017

Objectives

25 min of python academy
[] finish lab rotation posters
[] send nexus email
[] write more for file structure wiki
read:
- ~/Desktop/msc/lit/pfam.pdf
- ~/Desktop/msc/lit/CulturingBacteria.pdf
[] finish graphic for file stucture and write notes about it on wiki
hmmer script with Chiara

Notes

pfam.pdf:
- protein families are built on reference proteomes
- related pfam entries are are grouped into sets called "clans"
- for each pfam, a representative subset of the entire multiple sequence alignment is made which (seed alignment) which is then use to generate a HMM profile
CulturingBacteria.pdf:
- culturing has been highly biased to a few phylogenetic groups
- ≥50% of ORFs from annotated genomes have no functional annotation
- ≥80% of ORFS cannot be assigned to a metabolic pathway

Outlook

28.9.2017

Objectives

[] finish lab rotation posters
send nexus email
write more for file structure wiki
[] finish graphic for file stucture and write notes about it on wiki

Notes

short description of masters thesis: Ecological implications of the metagenomic functional unknown fraction in the marine environmentt

Outlook

#29.9.2017

Objectives

finish lab rotation posters
[] finish graphic for file stucture and write notes about it on wiki

Notes

Outlook

2.10.2017

Objectives

monitor hmmbuild/emit
regain access to bigmem and UV2000
[] test the hmm profiles on random set of sequences, see how well they return the clusters
[] create a sequence similarity network between all KNOWN, GU and EU
[] explore singletons
[] explore distribution of EU and GU in different sites

Notes

mschecht@bigmem-2:/scratch/mschecht/hmm_check.sh

 cp <(find EU/ -name 'test_output') hmm_EU_all/

 # this worked but not retrieving all of the files for some reason

Outlook

3.10.17

Objectives

[] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
[] create a sequence similarity network between all KNOWN, GU and EU
[] explore singletons
[] explore distribution of EU and GU in different sites

Notes

Outlook

4.10.17

Objectives

[] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
[] create a sequence similarity network between all KNOWN, GU and EU
[] explore singletons
explore distribution of EU and GU in different sites

Notes

Outlook

5.10.17

Objectives

[] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
[] create a sequence similarity network between all KNOWN, GU and EU
[] explore singletons
[] read:
- [] /Users/mattschechter/Desktop/msc/lit/sinha2017.pdf

Notes

meet with Antonio next week to discuss timeline for masters thesis
continued working on hmm profile test
Internal seminar: habitat group mol ecol of hydrothermal vents

Outlook

look into how hmmsearch, can one use a file with multiple concatenated hmm profiles?
figure out how to proportionalize data in R

6.10.17

Objectives

[] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
[] create a sequence similarity network between all KNOWN, GU and EU
[] read:
- [] /Users/mattschechter/Desktop/msc/lit/sinha2017.pdf
explore abundances of EU

Notes

always start off with whole cluster file then parse within R, this will reduce the number of files and keep things organized

Outlook

explore the genomic unknowns on all the genomes that chiara downloaded from ncbi and removed the redundants, then see the distribution of the clusters for each genera

9/10/2017

Objectives

[] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
[] create a sequence similarity network between all KNOWN, GU and EU
[] read:
- /Users/mattschechter/Desktop/msc/lit/sinha2017.pdf
- [] /Users/mattschechter/Desktop/msc/lit/Venter2006.pdf
- [] /Users/mattschechter/Desktop/msc/lit/Pesant2015.pdf

Notes

sinha2017.pdf:
- reproducibility of human microbiome studies requires meta-analysis of reproducible data
- DNA extraction , amplification protocols, and bioinformatics can result in variation similar to that of the biological differences themselves
- It is hard to set a single protocol due to variation of resources and biological questions
made random set of 1M ORFSq
made random set of 10,000 EU hmm profiles
- hmm test is running on bigmem-1
EUPC sample abundances mapping is done

Outlook

-attempting to run NMDS on the EUPC fraction... issues with TARA data temperateure

10.10.2017

Objectives

[] read:
- [] /Users/mattschechter/Desktop/msc/lit/Venter2006.pdf
- [] /Users/mattschechter/Desktop/msc/lit/Pesant2015.pdf
test the hmm profiles on random set of sequences
[] see how well they return the clusters
[] cluster Environmental Unknown Protein Clusters (EUPC) based on environmental parameters (NMDS with temperature).

Notes

/scratch/mschecht/hmm_eval/GUPC_hmm_clstr_eval_results$ /bioinf/software/hmmer/hmmer-3.1b2/bin/hmmsearch -o hmm_results_o --domtblout hmm_hit_table --cpu 16 hmm.file sequence.file

hmmer search of GUs on random ORF file

/bioinf/software/hmmer/hmmer-3.1b2/bin/hmmsearch -o hmm_results_o --domtblout hmm_hit_table --cpu 16 GU_hmm_10k_random.file ../TARA_OSD_GOS_malaspina_random_1M.fasta

parse output of hmmsearch test on EUPC

../hmmsearch_output_parser.sh hmm_hit_table 1e-5 0.4

Outlook

make abundance chart with raw counts
move all data into SQLite
write script to filter hmm results for lower e-value and highest coverage
make presentation for Friday to summarize current results

11.10.2017

Objectives

[] read:
- [] /Users/mattschechter/Desktop/msc/lit/Venter2006.pdf
- [] /Users/mattschechter/Desktop/msc/lit/Pesant2015.pdf
[] write script to filter hmm results for lowest e-value and highest coverage
- [] Examine results on hmm profile test for EUPC clusters
Move all data into SQLite format
make presentation for Friday to summarize current results

Notes

longitude and latitude in OSD have start and stop, which one should we use for coordinates?
GOS -> sample MOVE858 has the same metadata except for # of reads, what should we do?
OSD has no units for O2
TARA and malaspina do not have dates

Outlook

13.10.2017

Objectives

[] read:
- [] ~/Desktop/msc/lit/Venter2006.pdf
- [] ~/Desktop/msc/lit/Pesant2015.pdf
- [] ~/Desktop/msc/lit/Jeanbille2016.pdf
[] write script to filter hmm results for lowest e-value and highest coverage
- [] Examine results on hmm profile test for EUPC clusters
[] move new clusters from discarded to GU

Notes

Outlook

18.10.2017

Objectives

[] write script to filter hmm results for lowest e-value and highest coverage
- [] Examine results on hmm profile test for EUPC clusters
move new clusters from discarded to GU
finish cleaning contextual data and move it into SQL

Notes

Outlook

20.10.2017

Objectives

[] write script to filter hmm results for lowest e-value and highest coverage
- [] Examine results on hmm profile test for EUPC clusters

Notes

Outlook

14.11.2017

Objectives

meet with Antonio to discuss presentation, HMM eval test, and choosing generalist vs specialists, ask about ocean provinces,
finish hmm-eval pipeline

Notes

wc -l all_no_rej.tsv
105697560 all_no_rej.tsv

grep -c 'eupc' all_no_rej.tsv
5751200

grep -c '^>' refined_ORFs.fa
99946360

# 5,751,200 more fasta headers in the all_no_rej.tsv than are extracted in the refined_ORFs.fa

Try to shuf the all_no_rej.tsv before running filterbyname.sh

Add longhurst to contextual data: Longhurst -> ocean provinces

Add filter size to contextual data

[x] index the EU and GU hmms -> running and working -> let Antonio know

[x] added ffindex step to hmm_build

Outlook

15.11.2017

Objectives

finish hmm-eval pipeline
classify coordinates by ocean provinces
add clusters_no_rejects to SQL

Notes

Outlook

16.11.2017

Objectives

[] finish hmm-eval pipeline [] classify coordinates by ocean provinces [] add cluster_no_rejects to SQL [] make upset plot of spread of EUPCs across various projects

Notes

use this path to start rstudio bigmem (solves issues with plottings) LD_LIBRARY_PATH=~/.linuxbrew/Cellar/libpng/1.6.32/lib:~/.linuxbrew/Cellar/zlib/1.2.11/lib rstudio

Lab_notebook

Table of Contents

Week 1: 20.9.2017, 21.9.2017, 22.9.2017

Week 2: 25.9.2017, 26.9.2017, 27.9.2017, 28.9.2017, 29.9.2017

Week 3: 2.10.2017, 3.10.2017, 4.10.2017, 5.10.2017, 6.10.2017

Week 4: 9.10.2017, 10.10.2017, 11.10.2017, 12.10.2017, 13.10.2017

20.9.2017

Objectives:

Notes:

Outlook:

21.9.2017

Objectives

Notes

Outlook

22.9.2017

Objectives

Notes

Outlook

25.9.2017

Objectives

Notes

Outlook

26.9.2017

Objectives

Notes

Outlook

27.9.2017

Objectives

Notes

Outlook

28.9.2017

Objectives

Notes

Outlook

Objectives

Notes

Outlook

2.10.2017

Objectives

Notes

Outlook

3.10.17

Objectives

Notes

Outlook

4.10.17

Objectives

Notes

Outlook

5.10.17

Objectives

Notes

Outlook

6.10.17

Objectives

Notes

Outlook

9/10/2017

Objectives

Notes

Outlook

10.10.2017

Objectives

Notes

Outlook

11.10.2017

Objectives

Notes

Outlook

13.10.2017

Objectives

Notes

Outlook

18.10.2017

Objectives

Notes

Outlook

20.10.2017

Objectives

Notes