-
Notifications
You must be signed in to change notification settings - Fork 0
Lab_notebook
Week 4: 9.10.2017, 10.10.2017, 11.10.2017, 12.10.2017, 13.10.2017
- Catch up on relavant literature relating to clustering method used in this project
- Parse names of samples of ORF database and bind them to sample metadata
- Strange repeated samples were deleted from GOS sampling campaign (JCVI_SMPL_1103283000058) repeated 5x
- print specific line in a text file
# prints the first line of a file
sed -n 1p filename
-
line 55 from TARA_OSD_GOS.orf.aa.fasta is blank
-
awk to parse out headers of ORF files
# Start with this command
# python script.py file > output
# python script to parse out headers
import sys, re
fd = open(sys.argv[1], "r")
for line in fd:
#Malaspina
# line = line[7:13]
#OSD
# line = re.search('OSD(\d)+_', line).group(0)[:-1]
#TARA
line = re.search('TARA_(\d)+_(\D)+_(.)+-(.)+_', line).group(0)[:-1]
#GOS
# line = re.search('GS(\d)+(\D)*-', line).group(0)[:-1]
sys.stdout.write(line+'\n')
#####################################################################
- TARA may not of worked, re-run tomorrow
- Finish parsing out headers from ORF file
- learn more about database management (R data.table, SQLite)
- Read:
- ~/Desktop/msc/lit/steinegger2016.pdf
- ~/Desktop/msc/lit/hauser2016.pdf
- ~/Desktop/msc/lit/uniclust.pdf
- 20 min of python code academy
- steinegger2016.pdf:
- MMSeq2 can potentially increase the annotated fraction of metagenomic analyis
- central step to metagenomic analysis is the annotation of ORFs by searching for similar protein sequences in databases
- BLAST uses see-and-extend approach with kmers which are matched and extended to a full gap alignment
- a large fraction of species found in metagenomic data may be distantly related to a well-known annotated genome, thus a large proportion of metagenomic sequences may be left unannotated
- MMSeq2 is for querying protein sequence... MMSeq (hauser2016.pdf) has the clustering function!
- hauser2016.pdf:
- MMSeq -> deep clustering of large datasets
- three modules
- pre-filtering -> kmer based similarity score between all "query sequences" and all "target sequences"
- alignment -> Smith-Waterman alignment
- clustering ->
- databases such as UniProt are becoming large and have redundant sequences -> a solution to this is to calculate a representative subset of sequences by clustering them by their similarity
- uniclust.pdf:
- due to the rapidly increases number of protein sequences in databases such as UniProt, it is more efficient to work with representative subsets of sequences rather than the total amount to lower computational time
- cluster evaluation: functional homogeneity of clusters assessed by comparing Gene Ontology terms
- Redoing (Parse names of samples of ORF database and bind them to sample metadata) with BASH commands
# parsing heads from GOS
sed -e 's/-/\t/' GS.headers | cut -f1 | sort -u --parallel 32 -S20% > GS_parsed.headers
# parsing headers from Malaspina
cut -c 8- ICM.headers | sed 's/_/\t/g' | cut -f1 | sort -u --parallel 32 -S20% > ICM_parsed.headers
# cut -> remove first 8 characters of the file
# sed -> replace first "_" with "\t"
# cut -> cut out field 1
# sort -> sort file so that only one copy of each header is left... parallel -> do this on 32 cores... -> S20% using 20% of the memory
# parsing headers from OSD
sed 's/_spades/\t/g' OSD.headers | cut -f1 | sort -u --parallel 32 -S20% > OSD_parsed.headers
# sed -> replace "_spades" with "\t"
# cut -> cut out field 1
# sort -> sort file so that only one copy of each header is left... parallel -> do this on 32 cores... -> S20% using 20% of the memory
# parsing headers from TARA
cut -f1-4 -d '_' TARA.headers | sort -u --parallel 32 -S20% > TARA_parsed.headers
# cut -> cut out fields 1-4 but file is now delimited by '_' (essentially cut at the 4th '_'
# sort -> sort file so that only one copy of each header is left... parallel -> do this on 32 cores... -> S20% using 20% of the memory
-
There may be issues with the Malaspina data headers from the ORF file... some headers are different from the metadata file (i.e. usually MP(\d){4} but ORF file has MP427.0)
-
In the TARA data, there are multiple samples with the same sample name. It appears that many samples were taken at the same site in the beginning of the expedition to test methods.
-
worked on data wrangling the contextual data
- Finish cleaning up the ORF headers and contextual data
- Map samples to world map with ggplot
- Make a box plot for environmental parameters and explore the data
- continue reading literature referenced in Chiara's wiki
- Start to explore the protein clusters and see how they react to different environmental parameters
- 20 min of python academy
- [] read:
- ~/Desktop/msc/lit/pfam.pdf
- pfam.pdf:
- protein families are built on reference proteomes
- NEW OFFICE!
- colored sample map based on salinity and temperature
- the low salinity areas tend to be near the costs due to possible fresh water input
- temperature corrolates with latitude
- worked on folder structure for clustering data
- finish reading pfam.pdf
- finish making graphic for file structure wiki
- start working with Chiara on hmmer script
- [] 20 min of python academy
- [] read:
- ~/Desktop/msc/lit/pfam.pdf
- ~/Desktop/msc/lit/CulturingBacteria.pdf
- [] finish graphic for file stucture and write notes about it on wiki
- [] hmmer script with Chiara
- script for removing rejected sequences from clusters
#!/bin/bash
# remove rejected sequences from cluster files
# print out each step of the script was it executes
set -x
set -e
# set path variables
EU=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_categories/EU_clusters.tsv
rejected=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_eval_stats/eval_results/msa_based_eval/rejected.tsv
# cut out field 2 (sequence headers) from the rejected sequences file
cut -f2 $rejected > temp1
# inverse file grep rejected sequences against environmental unknown clusters
grep -v -f temp1 $EU > /scratch/mschecht/Environmental_unknowns_norejects.tsv
rm temp1
- another version of the script using join
#!/bin/bash
# rejected2.sh
# script using join to remove the rejected sequenced from environmental unknown clusters
# print out each step of the script was it executes
set -x
set -e
# set path variables
EU=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_categories/EU_clusters.tsv
GU=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_categories/GU_clusters.tsv
Knowns=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_categories/K_clusters.tsv
Discarded=/bioinf/project/megx/UNKNOWNS/2017_09/cluster_categories/Discarded_clusters.tsv
rejected=/bioinf/projects/megx/UNKNOWNS/2017_09/cluster_eval_stats/eval_results/msa_based_eval/rejected.tsv
# join
# -1 2: file 1 field 2
# -2 4: file 2 field 4
# <(sort -k2,2 $rejected): process subsitution, sort field 2!
# <(sort -k4,4 $EU): process substitution, sort field 4!
join -1 2 -2 4 -v2 <(sort -k2,2 $rejected) <(sort -k4,4 $EU) > EU_clusters_no_rejects.tsv
# genomic unknowns only has 3 columns (cluster#, representative, sequence header)
join -1 2 -2 4 -v2 <(sort -k2,2 $rejected) <(sort -k3,3 $GU) > GU_clusters_no_rejects.tsv
join -1 2 -2 4 -v2 <(sort -k2,2 $rejected) <(sort -k4,4 $Knowns) > K_clusters_no_rejects.tsv
join -1 2 -2 4 -v2 <(sort -k2,2 $rejected) <(sort -k4,4 $Discarded) > Discarded_clusters_no_rejects.tsv
- write more in the file structure wiki
- continue working on the hmmer script with Chiara, need to test the script on a larger number of FASTA files,
- hmmcalibrate -> either -Eml or -O ... check this
- work on lab rotation poster!!!
- 25 min of python academy
- [] finish lab rotation posters
- [] send nexus email
- [] write more for file structure wiki
- read:
- ~/Desktop/msc/lit/pfam.pdf
- ~/Desktop/msc/lit/CulturingBacteria.pdf
- [] finish graphic for file stucture and write notes about it on wiki
- hmmer script with Chiara
- pfam.pdf:
- protein families are built on reference proteomes
- related pfam entries are are grouped into sets called "clans"
- for each pfam, a representative subset of the entire multiple sequence alignment is made which (seed alignment) which is then use to generate a HMM profile
- CulturingBacteria.pdf:
- culturing has been highly biased to a few phylogenetic groups
- ≥50% of ORFs from annotated genomes have no functional annotation
- ≥80% of ORFS cannot be assigned to a metabolic pathway
- [] finish lab rotation posters
- send nexus email
- write more for file structure wiki
- [] finish graphic for file stucture and write notes about it on wiki
- short description of masters thesis: Ecological implications of the metagenomic functional unknown fraction in the marine environmentt
#29.9.2017
- finish lab rotation posters
- [] finish graphic for file stucture and write notes about it on wiki
- monitor hmmbuild/emit
- regain access to bigmem and UV2000
- [] test the hmm profiles on random set of sequences, see how well they return the clusters
- [] create a sequence similarity network between all KNOWN, GU and EU
- [] explore singletons
- [] explore distribution of EU and GU in different sites
mschecht@bigmem-2:/scratch/mschecht/hmm_check.sh
cp <(find EU/ -name 'test_output') hmm_EU_all/
# this worked but not retrieving all of the files for some reason
- [] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
- [] create a sequence similarity network between all KNOWN, GU and EU
- [] explore singletons
- [] explore distribution of EU and GU in different sites
- [] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
- [] create a sequence similarity network between all KNOWN, GU and EU
- [] explore singletons
- explore distribution of EU and GU in different sites
- [] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
- [] create a sequence similarity network between all KNOWN, GU and EU
- [] explore singletons
- [] read:
- [] /Users/mattschechter/Desktop/msc/lit/sinha2017.pdf
- meet with Antonio next week to discuss timeline for masters thesis
- continued working on hmm profile test
- Internal seminar: habitat group mol ecol of hydrothermal vents
- look into how hmmsearch, can one use a file with multiple concatenated hmm profiles?
- figure out how to proportionalize data in R
- [] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
- [] create a sequence similarity network between all KNOWN, GU and EU
- [] read:
- [] /Users/mattschechter/Desktop/msc/lit/sinha2017.pdf
- explore abundances of EU
- always start off with whole cluster file then parse within R, this will reduce the number of files and keep things organized
- explore the genomic unknowns on all the genomes that chiara downloaded from ncbi and removed the redundants, then see the distribution of the clusters for each genera
- [] test the hmm profiles on random set of sequences, see how well they return the clusters
- create mix dataset with 1M random clusters from all categories
- test 10k hmm profiles with hmmsearch on the dataset
- [] create a sequence similarity network between all KNOWN, GU and EU
- [] read:
- /Users/mattschechter/Desktop/msc/lit/sinha2017.pdf
- [] /Users/mattschechter/Desktop/msc/lit/Venter2006.pdf
- [] /Users/mattschechter/Desktop/msc/lit/Pesant2015.pdf
- sinha2017.pdf:
- reproducibility of human microbiome studies requires meta-analysis of reproducible data
- DNA extraction , amplification protocols, and bioinformatics can result in variation similar to that of the biological differences themselves
- It is hard to set a single protocol due to variation of resources and biological questions
- made random set of 1M ORFSq
- made random set of 10,000 EU hmm profiles
- hmm test is running on bigmem-1
- EUPC sample abundances mapping is done
-attempting to run NMDS on the EUPC fraction... issues with TARA data temperateure
- [] read:
- [] /Users/mattschechter/Desktop/msc/lit/Venter2006.pdf
- [] /Users/mattschechter/Desktop/msc/lit/Pesant2015.pdf
- test the hmm profiles on random set of sequences
- [] see how well they return the clusters
- [] cluster Environmental Unknown Protein Clusters (EUPC) based on environmental parameters (NMDS with temperature).
/scratch/mschecht/hmm_eval/GUPC_hmm_clstr_eval_results$ /bioinf/software/hmmer/hmmer-3.1b2/bin/hmmsearch -o hmm_results_o --domtblout hmm_hit_table --cpu 16 hmm.file sequence.file
hmmer search of GUs on random ORF file
/bioinf/software/hmmer/hmmer-3.1b2/bin/hmmsearch -o hmm_results_o --domtblout hmm_hit_table --cpu 16 GU_hmm_10k_random.file ../TARA_OSD_GOS_malaspina_random_1M.fasta
parse output of hmmsearch test on EUPC
../hmmsearch_output_parser.sh hmm_hit_table 1e-5 0.4
- make abundance chart with raw counts
- move all data into SQLite
- write script to filter hmm results for lower e-value and highest coverage
- make presentation for Friday to summarize current results
- [] read:
- [] /Users/mattschechter/Desktop/msc/lit/Venter2006.pdf
- [] /Users/mattschechter/Desktop/msc/lit/Pesant2015.pdf
- [] write script to filter hmm results for lowest e-value and highest coverage
- [] Examine results on hmm profile test for EUPC clusters
- Move all data into SQLite format
- make presentation for Friday to summarize current results
- longitude and latitude in OSD have start and stop, which one should we use for coordinates?
- GOS -> sample MOVE858 has the same metadata except for # of reads, what should we do?
- OSD has no units for O2
- TARA and malaspina do not have dates
- [] read:
- [] ~/Desktop/msc/lit/Venter2006.pdf
- [] ~/Desktop/msc/lit/Pesant2015.pdf
- [] ~/Desktop/msc/lit/Jeanbille2016.pdf
- [] write script to filter hmm results for lowest e-value and highest coverage
- [] Examine results on hmm profile test for EUPC clusters
- [] move new clusters from discarded to GU
- [] write script to filter hmm results for lowest e-value and highest coverage
- [] Examine results on hmm profile test for EUPC clusters
- move new clusters from discarded to GU
- finish cleaning contextual data and move it into SQL
- [] write script to filter hmm results for lowest e-value and highest coverage
- [] Examine results on hmm profile test for EUPC clusters
- meet with Antonio to discuss presentation, HMM eval test, and choosing generalist vs specialists, ask about ocean provinces,
- finish hmm-eval pipeline
wc -l all_no_rej.tsv
105697560 all_no_rej.tsv
grep -c 'eupc' all_no_rej.tsv
5751200
grep -c '^>' refined_ORFs.fa
99946360
# 5,751,200 more fasta headers in the all_no_rej.tsv than are extracted in the refined_ORFs.fa
Try to shuf the all_no_rej.tsv before running filterbyname.sh
Add longhurst to contextual data: Longhurst -> ocean provinces
Add filter size to contextual data
[x] index the EU and GU hmms -> running and working -> let Antonio know
[x] added ffindex step to hmm_build
- finish hmm-eval pipeline
- classify coordinates by ocean provinces
- add clusters_no_rejects to SQL
[] finish hmm-eval pipeline [] classify coordinates by ocean provinces [] add cluster_no_rejects to SQL [] make upset plot of spread of EUPCs across various projects
- use this path to start rstudio bigmem (solves issues with plottings)
LD_LIBRARY_PATH=~/.linuxbrew/Cellar/libpng/1.6.32/lib:~/.linuxbrew/Cellar/zlib/1.2.11/lib rstudio