Further details about the construction of new phanta databases #54

efratmuller · 2025-03-21T13:32:51Z

efratmuller
Mar 21, 2025

Dear @yipinto and team!

I was wondering if you can perhaps share some more details about how you constructed the "UHGGv2 + UHGV "MQ+"" database you have kindly provided?

Specifically, I had 3 questions:

(1) The viral portion based on UHGV seems to also include viruses that did not meet the "MQ" (medium-quality) criteria as defined on the UHGV github. For example, vOTU-085841 is included in the phanta database but had an "uncertain" "viral-confidence" (as reported in UHGV's metadata) and therefore does not meet their "MQ" criteria. As another example, vOTU-018648 is only 49% complete (therefore not "MQ") but I see it in the database.

(2) In the UHGG2 portion it seems as though not all genomes were included (>280K genomes are listed in MGnify-UHGGv2). Can you please explain which genomes were included and how were they defined as strains/species in the db?

(3) Where were the non-bacterial/archeal/viral genomes sourced from?

Many many thanks in advance!
Efrat

meenachakra · 2025-03-21T22:04:51Z

meenachakra
Mar 21, 2025
Maintainer

Hi Efrat,
Thanks for your questions!

For point 1:

We conducted this work based on UHGV genomes/metadata from April 2023. in our database, the only genome in the vOTU-085841 cluster is UHGV-0186730 (please see the database taxonomy/nodes.dmp and taxonomy/names.dmp). This genome is annotated as Confident viral in the UHGV metadata from April 2023.
Similarly, the five genomes included under vOTU-018648 (UHGV-2218338, UHGV-0210983, UHGV-1933255, UHGV-1758727, UHGV-1466773) have > 50% completeness in the metadata from April 2023 and are all annotated as Medium-quality, Confident viral.

For point 2:

We essentially dereplicated UHGGv2 at 97.5% ANI using drep and included the representatives of each 97.5% ANI cluster in our database. More specifically, we dereplicated each precalculated 95% cluster of UHGGv2 at 97.5% ANI (to reduce computational demands) and then included the representatives of the 97.5% ANI clusters in our database. The reason for choosing 97.5% was to match the ANI cutoff used by HumGut (which formed the prokaryotic portion of our original database).
We annotated the representatives using GTDB-Tk. Each genome was assigned as a strain of its GTDB-Tk-assigned species. In the case of missing species, we assigned the species as the name of the original 95% dereplicated cluster. Please feel free to ask additional clarifying questions!

For point 3:

UHGGv2 includes both bacterial and archaeal genomes. Viruses were from UHGV. For gut eukaryotes (fungi), we used RefSeq, and for contaminants, we used the human genome (hg38) and the Core UniVec database from NCBI.

4 replies

efratmuller Mar 22, 2025
Author

@meenachakra Thank you for the quick reply and clear answers!

efratmuller Mar 28, 2025
Author

Hi @meenachakra, I had a small follow-up question: Do you think that including only species representatives (for either kingdom) rather than multiple "strains" per species as you discuss above, will dramatically decrease phanta's performance (=accuracy in resulting profiles)? Did you happen to explore more "compact" databases and compare to the detailed ones? Thank you in advance!!

meenachakra Mar 29, 2025
Maintainer

Hi Efrat, we did do this for the viral portion of our original/default database and found that including all MGV strains (rather than species representatives) increased the sensitivity of Phanta to viruses. For prokaryotes, we stuck with HumGut's 97.5% threshold since they already did the work of comparing to a more compact database, please see here - https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01114-w

Please let us know if you need further clarification!

efratmuller Apr 1, 2025
Author

Many thanks @meenachakra , I appreciate your quick responses!
Best,
Efrat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further details about the construction of new phanta databases #54

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Further details about the construction of new phanta databases #54

Uh oh!

efratmuller Mar 21, 2025

Replies: 1 comment · 4 replies

Uh oh!

meenachakra Mar 21, 2025 Maintainer

Uh oh!

efratmuller Mar 22, 2025 Author

Uh oh!

efratmuller Mar 28, 2025 Author

Uh oh!

Uh oh!

meenachakra Mar 29, 2025 Maintainer

Uh oh!

efratmuller Apr 1, 2025 Author

efratmuller
Mar 21, 2025

Replies: 1 comment 4 replies

meenachakra
Mar 21, 2025
Maintainer

efratmuller Mar 22, 2025
Author

efratmuller Mar 28, 2025
Author

meenachakra Mar 29, 2025
Maintainer

efratmuller Apr 1, 2025
Author