Decouple annotation from library data + other fixes from USC PANTHER by dustine32 · Pull Request #8 · ebi-pf-team/treegrafter

dustine32 · 2026-03-26T19:00:46Z

Hello!

I've been working over the past week with your version of TreeGrafter and implemented a major change for how we (the USC PANTHER team) use PAINT GO annotation data vs. PANTHER library data with TreeGrafter. Basically, the treegrafter can now specify an annotation file instead of hard-coding the PAINT_Annotatations_TOTAL.txt file that is/was included in our PANTHER TreeGrafter-specific library data tarballs.

The rationale (explained in this ticket) essentially boils down to: the library data (trees, FASTA, HMMs) are relatively more static than the annotation data, so we would like to be able to update annotation data in TreeGrafter more frequently than the library. This PR accommodates that and, as a bonus, fixes some other issues:

MSA length errors - handling multiple HMMer match domains. Same as Error: length of query MSF longer than expected PANTHER alignment length #3. Code is now more robust, expects this to happen occasionally, and handles it.
Certain bad tree files resulted in some sequences getting bad graft points (e.g., AN34) that do not really exist in the tree. The epa-ng step would then err out (ERR Failed to find: AN34) and the result output node_id col would be blank (-). This was an upstream PANTHER data issue with the tree files, so there is no code change in this PR. The change is in the new tarball.
Removed the Selenocysteine and Pyrrolysine AA normalization step in the prepare code since this was basically fixing the upstream data problem. I implemented the fix at PANTHER and pushed out the new data. Since this normalization was the only operation on library data in the original prepare command, the command now only expects an annotation file input for the other annotation data prepare operation (splitting into family-specific JSON files).
Added --print-go option to display annotation GO terms and protein classes in result output.
Removed the Bio/ folder in the repo, replaced by a requirements.txt pointing to the BioPython minimal version (>=1.86) to be installed via pip install. I may have overstepped my authority by removing it, so I'm totally open to putting the Bio/ folder back if needed.
The .plans/ folder files were generated by Claude Code to implement most of the above changes.

To be clear about what data you should currently use for PANTHER19.0 and the current PAINT annotation data:

PANTHER19.0 library URL: https://data.pantherdb.org/ftp/downloads/TreeGrafter/PANTHER19.0_data_trees_hmms_only.tar.gz
PAINT (Pan-GO 2.0) annotation file URL: https://data.pantherdb.org/ftp/downloads/pango/export_annotations/current/PAINT_TreeGrafter_Annotations_TOTAL.txt.gz

Note PANTHER19.0_data_trees_hmms_only.tar.gz is a one-off filename. For future library (e.g., PANTHER20.0) data tarballs, the filename will return to the PANTHER##.#_data.tar.gz convention, along with no longer containing any annotation file. I'll

Let me know what you think about these changes. I'm pretty flexible and am willing to either commit to working exclusively with the ebi-pf-team/treegrafter repo from now on and/or maintain my own fork for developing TreeGrafter further. Also, a belated thank you for converting TreeGrafter to python!

Unpin biopython from exact version

Fix MSF longer than expected error

Output GO and PC terms in results

Decouple PAINT annotation data from PANTHER library data

dustine32 added 9 commits March 19, 2026 13:29

Fix MSF longer than expected error

5e43f65

Update requirements.txt

7fe8e7e

Unpin biopython from exact version

Merge pull request #1 from dustine32/fix-querymsf-alignment

930303a

Fix MSF longer than expected error

Output GO and PC terms in results

c13a8cc

Merge pull request #2 from dustine32/add-print-go-flag

f284ef3

Output GO and PC terms in results

Separate annot file option

f52b398

Separate annotdir and libdir; drop fasta AA normalization

bdb7ed1

Minor README clarifications

a5be70a

Merge pull request #3 from dustine32/annotation-file-option

984557b

Decouple PAINT annotation data from PANTHER library data

matthiasblum self-requested a review March 27, 2026 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple annotation from library data + other fixes from USC PANTHER#8

Decouple annotation from library data + other fixes from USC PANTHER#8
dustine32 wants to merge 9 commits intoebi-pf-team:mainfrom
dustine32:main

dustine32 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dustine32 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant