Skip to content

mmarbout/Tuto_MetaTOR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

102 Commits
 
 
 
 
 
 

Repository files navigation

Tuto_MetaTOR

For this session we will learn how to use 3C/Hi-C data and MetaTOR to perform the binning of a simple metagenome. The data come from a mock community designed at the lab.

Table of contents

MetaTOR

Metagenomic Tridimensional Organisation-based Reassembly

if you want more described doc of MetaTOR and the different possibilities offered by the pipeline , various tutorials are available at the following links:

  • package is available here
  • An advanced tutorial is available here to explain how to use metaTOR.
  • Anvio manual curation of the contaminated bins. Available here.
  • Visualization and scaffolding of the MAGs with the contactmap modules of MetaTOR. Available here.

Principle of MetaTOR pipeline:

metator_pipeline

Dataset

In this analysis, we will use a simple metagenomic dataset with a defined community. It will allow us to perform some tests without too much computationnal time.

the different data for the tutorial need to be copied on your VM from the public space:

data are available on the public partition storage of your VM

cp  /ifb/data/public/teachdata/ebame/MetaTOR-2025/Tuto_MetaTOR.tar.gz ./

they need to be decompressed

tar -xvf Tuto_MetaTOR.tar.gz

The folder contain three folder:

  • FastQ: the FastQ files corresponding to the Hi-C library of the mock community.
  • assembly: the FastA files of the assembly.
  • metator_final: different files of the final ouput.
ls -l Tuto_MetaTOR/

the assembly can be found here : [Tuto_MetaTOR/assembly]

ls -l Tuto_MetaTOR/assembly/

Here the assembly has been made using ShotGun sequences (PE Illumina sequencing: 2x75bp, NextSeq500). Before building the assembly reads were filtered and trimmed using Cutadapt (v1.9.1). The assembly has been then obtained using Megahit (v1.1.1.2) with default paramters.

in order to perform the binning based on 3D contact, we also need 3C dataset from the same sample.

FastQ Hi-C PE reads can be found here (small part of the whole dataset): [Tuto_MetaTOR_2023/FastQ/]

ls -l Tuto_MetaTOR/FastQ/

Config

First of all, we have to activate the environment in conda

conda activate metator

Usage

MetaTOR is a modular pipeline allowing to perform each step separetly or in an end to end pipeline

metator --help

A metaTOR command takes the form metator action --param1 arg1 --param2 arg2 #etc.

There are three actions/steps in the metaTOR pipeline, which must be run in the following order:

  • network : Generate metaHiC contigs network from fastq reads or bam files and normalize it.

  • partition : Perform the Louvain or Leiden community detection algorithm many times to bin contigs according to the metaHiC signal between contigs.

  • validation : Use CheckM to validate the bins, then do a recursive decontamination step to remove contamination.

There is also a option to run the wole pipeline (end-to-end):

  • pipeline : Run all three of the above actions sequentially or only some of them depending on the arguments given. This can take a while.

There are a number of other, optional, miscellaneous actions:

  • qc : Generates some quality check on the output of metator.

  • contactmap : Generates a contact map from one bin from the final ouptut of metaTOR.

  • scaffold : try to scaffold a well covered bin from the final ouptut of metaTOR.

  • pairs : Sort the pairs file using pairtools. Compress them using bgzip. Index them using pairix.

  • version : display current version number.

  • help : display help message.

End-to-End pipeline

using the provided dataset, you can launch the whole pipeline.

metator pipeline --help
metator pipeline -F -i 10 -j 3 -a Tuto_MetaTOR/assembly/assembly_mock.fa -1 Tuto_MetaTOR/FastQ/lib1_3C_R1.fq.gz -2 Tuto_MetaTOR/FastQ/lib1_3C_R2.fq.gz -o Tuto_MetaTOR/out_MetaTOR/

NB: The option [-F] is mandatory if the putput directory already exist.

this commands can take some time ...

MetaTOR will provide you with various metrics about the whole pipeline. It will also generate different files necessary for downstream analysis.

ls -l Tuto_MetaTOR/out_MetaTOR/

you will find info about the contigs and their binning, here:

cat Tuto_MetaTOR/out_MetaTOR/contig_data_final.txt | head

but also about the MAGs, here:

cat Tuto_MetaTOR/out_MetaTOR/bin_summary.txt | head

NB: the file [binning.txt] allow to use it in ANVIO to clean the MAGs or to have visualization.

MetaTOR allow to restart command at different points of the pipeline. It is possible to redo a faster pipeline by using BAM files or PAIRS files as starting points. You can restart the pipeline with a different number of iterations of the louvain algorithm. Here we will restart the pipeline at the PAIRS level.

metator pipeline -F -i 5 -j 3 --start pair -1 Tuto_MetaTOR/out_MetaTOR/alignment_0_sorted.pairs.gz -a Tuto_MetaTOR/assembly/assembly_mock.fa -o Tuto_MetaTOR/out_MetaTOR_2/

We can also make different number of iterations of the louvain algorithm in order to see the variations in the provided output.

for it in $(seq 1 2 9)
do
echo "number of iterations:""$it"
metator pipeline  -F -i "$it" -j 3 --start pair -1 Tuto_MetaTOR/out_MetaTOR/alignment_0_sorted.pairs.gz -a Tuto_MetaTOR/assembly/assembly_mock.fa -o Tuto_MetaTOR/out_MetaTOR_it"$it"/
echo "FINITO"
echo ""
done

MetaTOR use the software miComplete to validate MAGs and to select MAGs that need to be cleaned through a recursive process of the algorithm. Indeed, in very large network (which is not the case here), the algorithm suffer from resolution limits and need sometimes to be re-run on sub-network. The software is a bit less precise than CheckM but is really faster and less memory consuming. Generally, at the end of the pipeline, we use CheckM or GTDB-tk to assess properly the quality of the retrieved MAGs and annotate them.

3D Analysis

3C data and MetaTOR (by it connection with our software hicstuff) also allow to generate contact matrices of various genomic object (contigs, bin, MAG, overlapping MAGs). However, i jsut realize that the last version of a package is no more compatible with our pipeline :(((

the command follow the following rules:

metator contactmap --help

now, we can generate one contactmap file

mkdir -p Tuto_MetaTOR/contact_map/
metator contactmap -a Tuto_MetaTOR/assembly/assembly_mock.fa -c Tuto_MetaTOR/metator_final/contig_data_final.txt -n "NODE_1078_len_298687" -o Tuto_MetaTOR/contact_map/ -O contig -F -f -e HinfI,DpnII Tuto_MetaTOR/metator_final/alignment_0_sorted.pairs.gz

by re-using the command, generate a contact map of the most covered or longest contig, the most covered or largest MAG .. etc .. (all the data you need are present in the repertory with the different output files [Tuto_MetaTOR/MetaTOR]). Be carefull to change the name of the output directory !!!!

WARNING !!! the command only generates the contact map files but not the pdf files. To generate an image file, we will use hicstuff and several command lines:

hicstuff have many commands and options

hicstuff --help 

one command allow to reconstruct contact map (i.e. matrices) with a fixed bin size in kilobase (kb)

hicstuff rebin --help 

here is example of a command line to rebin a contactmap to 10kb

hicstuff rebin -b 10kb -f Tuto_MetaTOR/contact_map/NODE_1078_len_298687.frags.tsv -c Tuto_MetaTOR/contact_map/NODE_1078_len_298687.chr.tsv Tuto_MetaTOR/contact_map/NODE_1078_len_298687.mat.tsv Tuto_MetaTOR/contact_map/NODE_1078_len_298687_10kb

another command of the hicstuff pipeline allow to directly rebin a matrix and generate a image file of the contact map

hicstuff view --help 

here is example of a command line to rebin a contactmap to 10kb and generate the corresponding pdf file

hicstuff view -b 10kb -o Tuto_MetaTOR/contact_map/NODE_1078_len_298687_10kb_raw.pdf -f Tuto_MetaTOR/contact_map/NODE_1078_len_298687.frags.tsv Tuto_MetaTOR/contact_map/NODE_1078_len_298687.mat.tsv

in this case, the contact map will be generated using the raw score of interactions. in general, we need to perform a normalization of the signal.

same command line but with the normalization step

hicstuff view -b 10kb -n -o Tuto_MetaTOR/contact_map/NODE_1078_len_298687_10kb_norm.pdf -f Tuto_MetaTOR/contact_map/NODE_1078_len_298687.frags.tsv Tuto_MetaTOR/contact_map/NODE_1078_len_298687.mat.tsv

you can now generate the different image files of your different matrices (the largest contig, a MAG ... etc). Be carefull with the binning size and factor when trying to generate matrix for MAGs !!! computation could be time consuming for large MAG with high resolution (few kb).

  • HiContacts: HiContacts provides tools to investigate (m)cool matrices imported in R by HiCExperiment. It leverages the HiCExperiment class of objects, built on pre-existing Bioconductor objects, namely InteractionSet, GInterations and ContactMatrix (Lun, Perry & Ing-Simmons, F1000Research 2016), and provides analytical and visualization tools to investigate contact maps.

HiContacts

and if you want are really interested in HiC data and contact map visualization adnd treatment, several tools are now availble to handle this type type of data:

  • cooler: Cooler is a support library for a sparse, compressed, binary persistent storage format, also called cooler, used to store genomic interaction data, such as Hi-C contact matrices. The cooler file format is an implementation of a genomic matrix data model using HDF5 as the container format. The cooler package includes a suite of command line tools and a Python API to facilitate creating, querying and manipulating cooler files.

cooler

MGE binning and Host-Association

MGE Binning

metator encompass two new modules (still in construction for a good integration) metator mge and metator host that allow to perform binning of MGE annotated contigs and associate them to their host.

the new version should be released soon ...

in case you want to have a look at its development and what we have done with it:

Phages with a broad host range are common across ecosystems

to perform MGE binning, you first need a file encompassing the contigs annotated as MGEs. you will find this file here:

cat Tuto_MetaTOR/metator_final/contig_mge.txt | head

these contigs have been categorized using geNomad. we can launch the mge module:

metator mge -h
metator mge -c Tuto_MetaTOR/metator_final/contig_data_final.txt -a Tuto_MetaTOR/assembly/assembly_mock.fa -b Tuto_MetaTOR/metator_final/binning.txt -m Tuto_MetaTOR/metator_final/contig_mge.txt -o Tuto_MetaTOR/MGE_out Tuto_MetaTOR/metator_final/alignment_0_sorted.pairs.gz

MGE Host association (module in construction)

this module is still under development to integrate well in the whole pipeline you will find in the directory [metator_final] the different important outputfiles of the whole pipeline

ls Tuto_MetaTOR/metator_final/mge/

References

Contact

Authors

Research lab

Spatial Regulation of Genomes (Institut Pasteur, Paris)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors