Perseus: refining Kraken2 taxonomic classifications of long reads and contigs

Perseus is a post-processing framework for refining Kraken2 taxonomic classifications, with a focus on long-read metagenomics (PacBio HiFi, ONT). While Kraken2’s exact k-mer matching enables fast and sensitive classification, it can produce overconfident fine-rank calls when evidence is sparse, conserved, or partially novel. Perseus addresses this limitation by distinguishing trustworthy from spurious taxonomic predictions using structured k-mer evidence already present in the Kraken2 output. Perseus is designed to reduce false positive fine-rank calls arising from conserved regions, sparse k-mer support, and reference database incompleteness—failure modes that are common in long-read and high-novelty metagenomes.

Perseus assigns confidence probabilities to each Kraken2 classification at every canonical taxonomic rank, enabling informed decisions to confirm assignments, back off to higher, lineage-consistent ranks, or convert predictions to unclassified.

Perseus is built on a multi-headed 1D convolutional neural network that operates directly on features derived from Kraken2 output. The workflow constructs a lineage-aware feature matrix from a standard Kraken2 output file, then performs inference to produce a Kraken2-compatible output augmented with per-rank confidence probabilities for each assignment. Perseus operates strictly as a downstream confidence filter and does not perform reclassification, alignment, or novel taxon discovery.

Installation

Conda installation (recommended)

Perseus is available through conda. We recommend creating a new environment:

conda create -n perseus -c matnguyen -c conda-forge -c pytorch perseus
conda activate perseus

Getting started

Feature extraction

Perseus will perform feature extraction on a Kraken2 output file and output a directory of sharded parquets containing the features.

perseus extract <kraken_file> <output_shards_directory>

Filtering

Perseus takes in the directory of sharded parquets and the Kraken2 output file for filtering.

perseus filter <shards_directory> <kraken_file> <output_path>

The output file will be similar to the Kraken2 output file, but without the string of k-mer matches, and with the following additional columns:

perseus_taxid - the taxonomic ID assigned by Perseus
prob_{rank} - the assignment probability at a canonical {rank}
chosen_rank - the final chosen rank assigned by Perseus
chosen_prob_at_rank - the probability at the final chosen rank

Testing Data

We provide some data for testing Perseus. They can be found under tests/test_data. The Kraken2 output file is tests/test_data/test_kraken, the shards are in tests/test_data/test_shards, and the expected Perseus output file is tests/test_data/filtered.txt.

Testing the Installation

Quick Example

Run Perseus on the included test data:

perseus extract tests/test_data/test_kraken.txt example_extract
perseus filter example_extract tests/test_data/test_kraken.txt example_filtered.txt

This should produce an output file example_filtered.txt.

Because Perseus uses floating-point operations (PyTorch), small numerical differences may occur across platforms. Therefore, the output may not match the reference file exactly with a simple diff.

To compare the output with the expected results using a numerical tolerance:

python scripts/compare_outputs.py example_filtered.txt tests/test_data/filtered.txt

Running the Full Test Suite (optional)

For a full reproducibility check, run the included test suite.

Install the testing dependency:

pip install pytest

Then run:

pytest -q

This runs unit tests and end-to-end pipeline tests used during development.

Citing Perseus

Our preprint can be found here: https://www.biorxiv.org/content/10.64898/2026.03.06.710148v1

Data Generation Scripts

Scripts for generating the inclusion/exclusion simulated data are found here: https://github.com/matnguyen/perseus-scripts

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github/workflows		.github/workflows
conda-recipe		conda-recipe
img		img
scripts		scripts
src/perseus		src/perseus
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
environment.cpu.yml		environment.cpu.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Perseus: refining Kraken2 taxonomic classifications of long reads and contigs

Installation

Conda installation (recommended)

Getting started

Feature extraction

Filtering

Testing Data

Testing the Installation

Quick Example

Running the Full Test Suite (optional)

Citing Perseus

Data Generation Scripts

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Perseus: refining Kraken2 taxonomic classifications of long reads and contigs

Installation

Conda installation (recommended)

Getting started

Feature extraction

Filtering

Testing Data

Testing the Installation

Quick Example

Running the Full Test Suite (optional)

Citing Perseus

Data Generation Scripts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 1

Languages

Packages