Metatranscriptomes-based sequence similarity networks uncover genetic signatures within parasitic freshwater microbial eukaryotes
This is a workflow to reproduce analysis conduced in Monjot et al., 2025
First, clone github repository: git clone https://github.com/amonjot/SSN_Monjot_2024.git
Second, define current directory: cd SSN_Monjot_2024
*******************************
Directory organization
*******************************
SSN_Analysis_Monjot_et_al._2025
|-> rawdata (sub-directory for trophic modes table and metadata)
|-> ko_to_hierarchy.txt (KO id definition table from https://www.genome.jp/kegg-bin/show_brite?ko00001.keg)
|-> Table_S1.tsv (own in-house trophic modes database (Monjot et al., 2025))
|-> annotation_ko.tsv (The output of the KO annotation step using koFamScan) [https://doi.org/10.5281/zenodo.11972386]
|-> proteins_from_Unigenes_CEQ.fa (The proteins fasta file provided by the proteins prediction step after filtration) [https://doi.org/10.5281/zenodo.11972386]
|-> table_taxonomy.perUnigene.allUnigenes.tsv (The output of the taxonomic affiliation against MetaEuk) [https://doi.org/10.5281/zenodo.11972386]
|-> script (sub-directory for scripts)
|-> 0A_Prepare_taxonomy_files.sh (script to Prepare the trophic mode database model)
|-> 0B_Resume_METADATA.py (script to resume data related to transcripts)
|-> 0C_Cross-TranscriptID-ProteinID.py (script to link transcripts data and proteins ID)
|-> 1_Find_cluster.py (script to sort all Connected Components (CC) of the general sequence similarity network (SSN))
|-> 2_Retrieve_result.sh (script to create a result directory with all SSNs and their CCs)
|-> 3_Choose_Treshold.R (script to test all identity treshold and determine functional homogeneity)
|-> 4_Igraph.R (script to process CCs graph)
|-> 5_Retrieve_Figures.sh (script to retrieve published figures from result directory)
|-> environment_REnv_Monjot_2024A.yml (conda environment)
|-> metatranscriptome_scripts (sub-directory for scripts related to metatranscriptomic)
|-> parse_ko_hits.py (script to parse ko annotation using the first hit)
|-> map_taxo_to_unigene.py (script to map taxonomic affiliation of proteins to unigene)
|-> SSN_env (sub-directory with snakemake pipeline to process SSN)
|-> Snakefile
|-> modules
|-> config
Install miniconda following the standard procedure (https://docs.conda.io/projects/miniconda/en/latest/miniconda-install.html)
Then, install conda environment with the following script:
conda env create -f script/environment_REnv_Monjot_2024A.yml
conda activate REnv_Monjot_2024AThis installs the following tools:
- diamond=2.1.7=h43eeafb_1
- numpy=1.26.4=py311h64a7726_0
- pandas=2.2.2=py311h14de704_1
- parallel=20230722=ha770c72_0
- pip=24.0=pyhd8ed1ab_0
- python=3.11.9=hb806964_0_cpython
- r-base=4.3.3=hf0d99cb_1
- seqtk=1.4=he4a0461_2
- snakemake=8.13.0=hdfd78af_0
- biopython=1.79=py311hd4cff14_3
- matplotlib=3.8.4=py311h38be061_2
You have to install R packages to Launch script n°3 and n°4:
* ggplot2 * ggstatsplot
* plyr * stringr
* dplyr * treemapify
* tidyr * elementalist
* cowplot * gplots
* paletteer * igraph
* reshape2 * ggraph
* varhandle * tidygraph
* ggrepel * grid
* ggpubr * FactoMineR
* ggsci * factoextra
* scales * rstatix
* hrbrthemes * ggh4x
* svglite * fmsb
* ggupset
The raw data are available in the public databases under the umbrella of the BioProject PRJEB61515. There are 32 samples, each of them assembled one-by-one with oases and were clustered all together with CD-HIT-EST with the thresholds identity >95% over >90% of the length of the smallest sequence. Moreover, transcripts longer than 50kb were discarded, this resulted in 10.359.104 representative assembled transcripts, called herafter Unigenes. The protocol is described in grater details in the work from Carradec et al. 2018.
We then removed human contamination from the Unigenes. Unigenes were aligned to the Human genome assembly GRCh38.p13 with minimap2 v2.14-r894-dirty and the default parameters. All Unigenes with a hit against the genome weere considered as contaminant as thus discarded for future analysis. Here is an example of code to obtain the list of contaminants:
minimap2 -t 12 GRCh38.p13.genome.fa.gz Unigenes.fa | cut -f 1 | sort | \
uniq >list_unigene_human_contaminantA total of 32.218 Unigenes were discarded.
We used TransDecoder v5.5.0 to predict coding sequences present on the Unigenes. The minimal protein length was set to 70 amino-acids as we observed Unigenes without protein with the default parameters:
# Extract long Open-Reading Frames
TransDecoder.LongOrfs -m 70 --output_dir out_transDecoder -t unigenes.fa
# Predict the likely coding regions
TransDecoder.Predict --output_dir out_transDecoder -t unigenes.fa
# Clean the deflines
sed -i "s/ \+$//" unigenes.fa.transdecoder.pepProteins have been check against the AntiFam database and HMMER v3.3.2 using the profiles' score cutoff. Spurious proteins were then discarded:
# Resources:
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/AntiFam/current/Antifam.tar.gz
tar -zxf Antifam.tar.gz
# Run the comparison: only positive hits
hmmsearch --cut_ga --noali --tblout antifam_search.tsv AntiFam.hmm proteins.faWe used MetaEuk version commit 57b63975a942fbea328d8ea39f620d6886958eca. The taxonomic affiliation is based on the database provided by MetaEuk authors, available here. This web-page proposes a link to download the data as well as a complete description of the origin of data. Beware the database is a 20 GB tar.gz archive that takes up to 200 GB of disk-space once uncompressed.
MetaEukTaxoDB=MMETSP_zenodo_3247846_uniclust90_2018_08_seed_valid_taxids
# Create the 'MMSeqs' database
metaeuk createdb unigenes.fa UnigeneDB
# Search the taxonomy for each protein
metaeuk taxonomy UnigeneDB $MetaEukTaxoDB Unigene_taxoDB tmp \
--majority 0.5 --tax-lineage 1 --lca-mode 2 --max-seqs 100 -e 0.00001 \
-s 6 --max-accept 100
# Get a Krona-like plot
metaeuk taxonomyreport $MetaEukTaxoDB Unigene_taxoDB \
Unigene_report.html --report-mode 1
# Get a tsv
metaeuk createtsv UnigeneDB Unigene_taxoDB \
Unigene_taxonomy_result.tsvThen we associated the taxonomy of the protein to its corresponding Unigene. In the case where a single protein is present on a Unigene, we simply transfered the taxonomic annotation. Otherwise we applied this strategy:
- one or many unclassified proteins and a single affiliated protein, we transfer the affiliation as is
- at least two affiliated protein proteins: Lowest Common Ancestor strategy
This step is performed with the script metatranscriptome_scripts/map_taxo_to_unigene.py:
python3 metatranscriptome_scripts/map_taxo_to_unigene.py \
-i Unigene_taxonomy_result.tsv \
-b unigenes.fa.transdecoder.bed \
-o Unigene_taxonomy_result.per_Unigene.tsvIt is important to note that Unigenes without predicted proteins are not present in this file.
From the taxonomic information, Unigenes affiliated to Bacteria, Archaea
or Viruses were removed, representing approximatly 250.000 Unigenes.
As our focus is on single-cell eukaryotes, about 150.000 Unigenes affiliated to
Metazoans have been discarded.
Proteins have been annotated with the KEGG's KO through the tool koFamScan v1.3.0 using the KO HMM profiles release "2022-01-03", available here.
# Get data
wget https://www.genome.jp/ftp/db/kofam/archives/2022-01-03/ko_list.gz
gunzip ko_list.gz
wget https://www.genome.jp/ftp/db/kofam/archives/2022-02-01/profiles.tar.gz
tar -zxf profiles.tar.gz
# Run
exec_annotation -o results.koFamScan.tsv --format detail-tsv --ko-list ko_list\
--profile profiles proteins.faThen, we sent the results to the Python3 script
metatranscriptome_scripts/parse_ko_hits.py:
python3 parse_ko_hits.py --input results.koFamScan.tsv \
--output results.koFamScan.parsed.tsvThis script parses the results in this order of preference:
- Keep hit if tagged as significant by KoFamScan, the ones with a
*in the first field. This type of result is tagged with "significant" in the result - If the current protein has no significant hit, keep the best hit if its e-value is ≤ 1.e-5.
Results of all annotation process are available here. Download these files and place it in the rawdata directory for use.
We generate a table ready to be filled corresponding to the model of the trophic mode database :
bash script/0A_Prepare_taxonomy_files.shThen, we have to complete the resulted table Table_assoc_Temp.tsv using Bibliography to create the final trophic mode database. This table correspond to the Supplementary file Table_S1.tsv and is provided in the rawdata directory.
This table was developed as part of a metatranscriptomic study (Monjot et al., 2023) using bibliography (362 scientific articles), to link, when possible, proteins taxonomic affiliation (i.e. 1 052 distinct affiliations), to trophic modes.
Taxonomic ranks are specified with the following code: d_(Division), k_(Kingdom), p_(Phylum), c_(Class), o_(Order), f_(Family) and g_(Genus). Trophic modes were identified using the following codes: PHOTO: photo-osmo-mixotrophs, MIXO: photo-osmo-phago-mixotrophs, HET: heterotrophs, SAP: saprotrophs and PARA: parasites (facultative and obligate).
Cross all proteins informations:
python3 script/0B_Resume_METADATA.py > rawdata/out_RESUME.txt
python3 script/0C_Cross-TranscriptID-ProteinID.py > rawdata/out-RESUME-PROTID.txtOnly informations characterizing proteins affiliated to microbial eukaryotes should be retained (by removing those affiliated to pluri- or multi-cellular organisms):
echo -e "ID_Protein"\\t"ID_transcript"\\t"ncbi_taxonomy"\\t"Taxonomy"\\t"SpeciesInDataset"\\t"TRT_1"\\t"TRT_2"\\t"TYP_1"\\t"TYP_2"\\t"ko" > Metadata_Unicellular.txt
cat rawdata/out-RESUME-PROTID.txt | grep -v "Pluri- or multi-cellular" >> Metadata_Unicellular.txtFrom the inital fasta file, we remove all sequences affiliated to pluri- or multi-cellular organisms:
cat rawdata/out-RESUME-PROTID.txt | grep -v "Pluri- or multi-cellular" | cut -f1 > rawdata/trsc_Unicellular.txt
seqtk subseq rawdata/proteins_from_Unigenes_CEQ.fa rawdata/trsc_Unicellular.txt > proteins_from_Unigenes_CEQ_Unicellular.fastaUsing the Fasta and the Metadata files, we launch a snakemake pipeline which produce SSNs using diamond. This pipeline used a config file which target inputs and define all treshold to be used to filter general diamond alignment.
mkdir SSN_env/tr_files_test/
mkdir SSN_env/an_files_test/
mv proteins_from_Unigenes_CEQ_Unicellular.fasta SSN_env/tr_files_test/
mv Metadata_Unicellular.txt SSN_env/an_files_test/
cd SSN_env/
snakemake -c 16
# nohup snakemake -c 16 --rerun-incomplete > out.snakemake.txt &
cd ..Using the diamond results, we recovered the derived CCs for each identity threshold used.
nohup parallel -j 16 -k 'python3 script/1_Find_cluster.py {}' ::: 80_65_1e-50 80_70_1e-50 80_75_1e-50 80_80_1e-50 80_85_1e-50 80_90_1e-50 80_95_1e-50 80_100_1e-50 > out.find_cluster.txt &
bash script/2_Retrieve_result.shAll files generated by the previous steps are also available at the following zenodo repository: https://zenodo.org/doi/10.5281/zenodo.11972385
If you want to use the identical SSN files which we used. Place them (the file "Lagoon_output") in a "result/" directory located in the current directory.
To launch these step, please refer you to the first chapter to install R dependencies. We carry out different checks and benchmarks to better define the similarity threshold to be used.
Rscript script/3_Choose_Treshold.RIn Monjot et al., 2025, we choose the treshold of 80% of similarity. We therefore launch the script to process SSN using this treshold as following:
Rscript script/4_Igraph.R 80_80_1e-50
To retrieve article figures, run following script:
bash script/5_Retrieve_Figures.sh 80_80_1e-50The resulting figures can be found in Monjot_et_al_2025/ directory.