Skip to content

Decouple annotation from library data + other fixes from USC PANTHER#8

Open
dustine32 wants to merge 9 commits intoebi-pf-team:mainfrom
dustine32:main
Open

Decouple annotation from library data + other fixes from USC PANTHER#8
dustine32 wants to merge 9 commits intoebi-pf-team:mainfrom
dustine32:main

Conversation

@dustine32
Copy link
Copy Markdown

Hello!

I've been working over the past week with your version of TreeGrafter and implemented a major change for how we (the USC PANTHER team) use PAINT GO annotation data vs. PANTHER library data with TreeGrafter. Basically, the treegrafter can now specify an annotation file instead of hard-coding the PAINT_Annotatations_TOTAL.txt file that is/was included in our PANTHER TreeGrafter-specific library data tarballs.

The rationale (explained in this ticket) essentially boils down to: the library data (trees, FASTA, HMMs) are relatively more static than the annotation data, so we would like to be able to update annotation data in TreeGrafter more frequently than the library. This PR accommodates that and, as a bonus, fixes some other issues:

  1. MSA length errors - handling multiple HMMer match domains. Same as Error: length of query MSF longer than expected PANTHER alignment length #3. Code is now more robust, expects this to happen occasionally, and handles it.
  2. Certain bad tree files resulted in some sequences getting bad graft points (e.g., AN34) that do not really exist in the tree. The epa-ng step would then err out (ERR Failed to find: AN34) and the result output node_id col would be blank (-). This was an upstream PANTHER data issue with the tree files, so there is no code change in this PR. The change is in the new tarball.
  3. Removed the Selenocysteine and Pyrrolysine AA normalization step in the prepare code since this was basically fixing the upstream data problem. I implemented the fix at PANTHER and pushed out the new data. Since this normalization was the only operation on library data in the original prepare command, the command now only expects an annotation file input for the other annotation data prepare operation (splitting into family-specific JSON files).
  4. Added --print-go option to display annotation GO terms and protein classes in result output.
  5. Removed the Bio/ folder in the repo, replaced by a requirements.txt pointing to the BioPython minimal version (>=1.86) to be installed via pip install. I may have overstepped my authority by removing it, so I'm totally open to putting the Bio/ folder back if needed.
  6. The .plans/ folder files were generated by Claude Code to implement most of the above changes.

To be clear about what data you should currently use for PANTHER19.0 and the current PAINT annotation data:

Note PANTHER19.0_data_trees_hmms_only.tar.gz is a one-off filename. For future library (e.g., PANTHER20.0) data tarballs, the filename will return to the PANTHER##.#_data.tar.gz convention, along with no longer containing any annotation file. I'll

Let me know what you think about these changes. I'm pretty flexible and am willing to either commit to working exclusively with the ebi-pf-team/treegrafter repo from now on and/or maintain my own fork for developing TreeGrafter further. Also, a belated thank you for converting TreeGrafter to python!

@matthiasblum matthiasblum self-requested a review March 27, 2026 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant