Hi Metabuli team,
I'm evaluating Metabuli on real metagenomic samples and noticed significantly
lower scores compared to synthetic benchmarks. I'm trying to understand if this
is expected and how to optimize for real-world data.
Score comparison across datasets:
| Dataset |
MMseqs2 confidence |
Metabuli score |
Sample type |
| CAMI-Marine |
0.93 |
0.81 |
Synthetic |
| IBS |
0.89 |
0.45 |
Real (gut) |
| Freshwater |
0.60 |
0.05 |
Real (environmental) |
My concerns:
- The dramatic score drop on real data makes it difficult to set confidence
thresholds for filtering
- Many assignments that look taxonomically reasonable have very low scores
- It's unclear whether low scores indicate:
- True uncertainty (novel/divergent organisms)
- Database coverage issues
- Algorithm behavior on fragmented/noisy real data
Questions:
- Is this score pattern expected for real metagenomic data?
- What factors most influence Metabuli's scoring on real vs. synthetic data?
- Are there parameters I should adjust for real samples? (e.g., --min-score,
--min-sp-score)
- How do you recommend filtering classifications from real data - by score
threshold or other metrics?
- Would using a more recent GTDB version significantly improve scores on
real samples?
My setup:
- Database: GTDB r214
- Metabuli version: 1.1.0
I've attached example outputs showing the score distribution. Any guidance
on interpreting and optimizing for real data would be greatly appreciated!
Thanks!
metabuli on freshwater-ERR4195020:

mmseq2 on freshwater-ERR4195020:

Hi Metabuli team,
I'm evaluating Metabuli on real metagenomic samples and noticed significantly
lower scores compared to synthetic benchmarks. I'm trying to understand if this
is expected and how to optimize for real-world data.
Score comparison across datasets:
My concerns:
thresholds for filtering
Questions:
--min-sp-score)
threshold or other metrics?
real samples?
My setup:
I've attached example outputs showing the score distribution. Any guidance
on interpreting and optimizing for real data would be greatly appreciated!
Thanks!


metabuli on freshwater-ERR4195020:
mmseq2 on freshwater-ERR4195020: