Skip to content

Conversation

@petrelharp
Copy link
Contributor

@petrelharp petrelharp commented Nov 8, 2025

Edit: more detail here. Context:

  • genetic_relatedness_matrix is built on top of divergence_matrix
  • divergence_matrix was undocumented
  • divergence_matrix, and hence genetic_relatedness_matrix, uses the Scheiber-Vishkin algorithm, which has to do with TMRCAs; so, it is not suitable to site mode, and instead computes mutation mode
  • however, when implemented, there was not a mutation mode, and site mode is very similar, so previously genetic_relatedness_matrix had site and branch mode, and a caveat in the docs saying "this isn't really quite site mode in the same sense as elsewhere in the docs"

So, here I'm

  • making genetic_relatedness_matrix and divergence_matrix take "mutation" mode, and not site mode
  • documenting divergence_matrix
  • making other statistics mode not take "mutation" mode
  • adding a warning to genetic_relatedness_matrix with site mode saying "you should use mutation mode" (it still throws an error)
  • I've not added a warning or anything to divergence_matrix because it was previously undocumented.

This has very few changes, because it's just a re-naming of the previous behavior. In other words, the mechanism has not changed, just what it's called. I couldn't help but do a little bit of refactoring of some tests along the way, apologies.

This is a breaking change, since now ts.genetic_relatedness_matrix(..., mode="site") will now throw an error. Note however that ts.genetic_relatedness_matrix(...) does not change, because previously the default was mode="site" and now it is mode="mutation", which is the same thing. So, we could allow mode="site" to keep working, but deprecated, but this isn't right, because it is not computing the same thing as genetic_relatedness(..., mode="site"); it will be computing the same thing as genetic_relatedness(..., mode="mutation") once the latter is implemented.

@codecov
Copy link

codecov bot commented Nov 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.80%. Comparing base (dacbf38) to head (aaacc24).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3314   +/-   ##
=======================================
  Coverage   89.80%   89.80%           
=======================================
  Files          29       29           
  Lines       31043    31052    +9     
  Branches     5682     5685    +3     
=======================================
+ Hits        27877    27886    +9     
  Misses       1779     1779           
  Partials     1387     1387           
Flag Coverage Δ
c-tests 86.85% <100.00%> (+<0.01%) ⬆️
lwt-tests 80.38% <ø> (ø)
python-c-tests 87.06% <100.00%> (+<0.01%) ⬆️
python-tests 98.84% <100.00%> (+<0.01%) ⬆️
python-tests-no-jit 33.59% <25.00%> (-0.01%) ⬇️
python-tests-numpy1 50.17% <25.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
c/tskit/trees.c 91.00% <100.00%> (+<0.01%) ⬆️
python/_tskitmodule.c 87.06% <100.00%> (+<0.01%) ⬆️
python/tskit/trees.py 98.88% <100.00%> (+<0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

divergence_matrix to this mode; document divergence_matrix
@petrelharp
Copy link
Contributor Author

Note that the docs are a bit light on what the mode means generally, but that's intentional, since currently it's only used in these two functions.

@gregorgorjanc
Copy link
Member

I vote for the breaking change and throwing an error because the proposed change is doing the right thing.

@hyanwong
Copy link
Member

hyanwong commented Nov 8, 2025

I agree with @gregorgorjanc

@petrelharp
Copy link
Contributor Author

I've added an informative warning saying to change "site" to "mutation" now, which removes any hesitation I had about the breaking change. (It still errors, though.)

@petrelharp petrelharp requested a review from benjeffery November 9, 2025 19:52
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, but I've gotten lost in what's actually being done here. Why are we updating all the code for divergence matrix, if it's not mentioned in the changelog? The changes made here feel a lot bigger than what's being discussed, but I've forgotten where we actually landed with divmat in the end.

@benjeffery benjeffery added this to the Python 1.0 milestone Nov 10, 2025
@petrelharp
Copy link
Contributor Author

The connection is that genetic_relatedness_matrix uses divergence _matrix under the hood, so this is actually affecting both. My mistake missing it from the changelog.

Co-authored-by: Jerome Kelleher <jk@well.ox.ac.uk>
@petrelharp
Copy link
Contributor Author

I did not put the warning in for divmat because it was previously undocumented.

@benjeffery
Copy link
Member

benjeffery commented Nov 11, 2025

Hi @petrelharp, I'm trying to get my head around this. tsk_treeseq_divergence_matrix_site is renamed to tsk_treeseq_divergence_matrix_mutation but there are no other changes to it? Does this mean that it was already doing the right thing for mutation mode?

@gregorgorjanc
Copy link
Member

@benjeffery yes, that is my understanding.

@benjeffery
Copy link
Member

I also can't see any tests (I might have missed them) that aren't discrete_genome=False and actually have multiple mutations per site.

@benjeffery
Copy link
Member

@benjeffery yes, that is my understanding.

Sorry, re-read Peter's initial message and he mentions it there.

@petrelharp
Copy link
Contributor Author

See updated explanation above.

@petrelharp
Copy link
Contributor Author

Why are we updating all the code for divergence matrix, if it's not mentioned in the changelog?

It is mentioned in the CHANGELOG: the changelog says - Add ``TreeSequence.divergence_matrix``, which was previously undocumented.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! That all makes sense, thanks.

I've logged #3320 to track documenting threading parameters.

@benjeffery
Copy link
Member

I'm still wondering if we need tests with multiple mutations per site here?

@jeromekelleher
Copy link
Member

Easy enough to add?

@petrelharp
Copy link
Contributor Author

petrelharp commented Nov 12, 2025

I'm still wondering if we need tests with multiple mutations per site here?

Ah right, missed that query, sorry. We do already - this is not new functionality, and when @jeromekelleher implemented this he did have those tests, for instance here.

@petrelharp petrelharp added this pull request to the merge queue Nov 12, 2025
@petrelharp petrelharp removed this pull request from the merge queue due to a manual request Nov 12, 2025
@benjeffery
Copy link
Member

Sorry @petrelharp, I'm still confused. In #3299 you say that the existing site mode is only a valid mutation mode for a tree sequence where there is only one mutation per site (i.e. infinite sites model). If we are are renaming the existing site mode to be a true mutation mode, don't we need to make it a valid mutation mode for all tree sequences?

@petrelharp
Copy link
Contributor Author

In #3299 you say that the existing site mode is only a valid mutation mode for a tree sequence where there is only one mutation per site (i.e. infinite sites model). If we are are renaming the existing site mode to be a true mutation mode, don't we need to make it a valid mutation mode for all tree sequences?

Hm, that's certainly not what I meant to say. What I meant to say is that there are two similar but distinct modes, "site" and "mutation", that we would sometimes previously confuse for each other. As I've written in this PR:

  • branch counts each bit of branch length
  • mutation counts each mutation (so, measuring branch length by number of mutations)
  • site counts whether alleles differ at each site

We might phrase it a different way: suppose that we have a function f(A) that maps from subsets A of the samples to the real numbers. We could use this in an additive way to summarize patterns in a tree sequence as follows:

  • branch: apply f( ) to the samples below each branch, weighted by branch length
  • mutation: apply f( ) to the samples below each mutation
  • site: apply f( ) to the sets of samples that carry each allele

The thing about infinite sites is that if no site has more than one mutation, then site and mutation are the same; they only differ if there's back mutation or multiple mutations on the same site.

We would ideally implement mutation mode for all (most?) stats, but that's TODO. We aren't planning to implement site mode for the two stats I deal with here, because site mode (since it's to do with allelic state, which may not agree with the trees) doesn't play well with the underlying algorithm.

Does this answer your question?

@benjeffery
Copy link
Member

benjeffery commented Nov 14, 2025

Does this answer your question?

Yes, I get the distinction between the different modes, and how infinite sites makes site mode the same as mutation mode. What I don't understand is how the current implementation of tsk_treeseq_divergence_matrix_mutation in this PR does mutation mode.

In group_alleles at https://github.com/tskit-dev/tskit/blob/main/c/tskit/trees.c#L8718 the grouping is by allelic state, not mutation events, then the divergence matrix is incremented once per pair of different alleles, not considering mutations.

To explain differently, the only place tsk_treeseq_divergence_matrix_mutation uses any mutation data is in the decoding of genotypes, so I don't see how it can be doing "mutation counts each mutation (so, measuring branch length by number of mutations)" as you say above.

Sorry to keep on!

@petrelharp
Copy link
Contributor Author

Ah! Sorry, let me investigate.

@petrelharp
Copy link
Contributor Author

I'll investigate in the code, but according to our docstring, we're doing mutation mode:
Screenshot From 2025-11-14 16-28-18

@petrelharp
Copy link
Contributor Author

Dang, good catch - looks like you're right, @benjeffery. I'm not sure how that happened (but suspect it was my fault). I'll just verify (probably with a test) that this is what's going on, and if so, put up a new PR with just a corrected docstring (and documentation for divergence_matrix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants