Multi-threaded derivative #230

bredelings · 2025-08-19T15:25:57Z

Allow multithreading for derivative operations. This PR is copying some changes from ji-group/action-dev.

libhmsbeagle/CPU/BeagleCPUImpl.hpp

bredelings · 2025-08-19T16:07:46Z

I made the CI checks for compilation run even if the branch isn't master. Let me know if I should remove this.

bredelings · 2025-08-19T16:41:16Z

First test:

markov-modulated model + HMC gives the same result as the hmc-clock branch (using file https://github.com/ji-group/ActionCPUManuscript/blob/main/xmls/grad_yeast_codon_hmc_K4.xml).
Speedup 18.4s -> 4.7s with 12 cores (3.9x speedup)

bredelings · 2025-08-19T16:51:45Z

Test: with WNV_skyline_HMC_diagonal_only_rates.xml, it seems like there is a bug. With hmc-clock, we get:

# BEAST v10.5.0-beta4 Prerelease #6bd3e4b65
# Generated Tue Aug 19 12:49:23 EDT 2025 [seed=1]
# -seed 1 -overwrite WNV_skyline_HMC_diagonal_only_rates.xml
# keywords: skyline
state	Posterior   	Prior       	Likelihood  	rootHeight  	age(root)   	Rate        
0	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
100	-29379.4329 	-3401.6519  	-25977.7809 	8.92510     	1998.70     	5.69209E-4  	-
200	-29367.3947 	-3406.0099  	-25961.3848 	8.92510     	1998.70     	5.63616E-4  	-
300	-29387.6408 	-3408.6317  	-25979.0091 	8.92510     	1998.70     	6.16E-4     	-
400	-29368.3503 	-3386.4367  	-25981.9135 	8.92510     	1998.70     	6.0875E-4   	-
500	-29369.9142 	-3391.6816  	-25978.2325 	8.92510     	1998.70     	5.98373E-4  	-
600	-29371.7778 	-3402.6489  	-25969.1289 	8.92510     	1998.70     	6.0299E-4   	-
700	-29377.6585 	-3406.8020  	-25970.8564 	8.92510     	1998.70     	5.94365E-4  	-
800	-29370.1016 	-3405.9741  	-25964.1275 	8.92510     	1998.70     	6.11914E-4  	-
900	-29388.7663 	-3410.7179  	-25978.0484 	8.92510     	1998.70     	6.00581E-4  	-
1000	-29371.4701 	-3397.6304  	-25973.8397 	8.92510     	1998.70     	5.93565E-4  	-

Operator analysis
Operator                                          Tuning   Count      Time     Time/Op  Pr(accept) Smoothed_Pr(accept)
VanillaHMC(branchRates.rates)                     0.233   990        44       0.04     0.7919      0.86

But with this branch, we get:

# BEAST v10.5.0-beta4 Prerelease #6bd3e4b65
# Generated Tue Aug 19 12:50:50 EDT 2025 [seed=1]
# -seed 1 -overwrite WNV_skyline_HMC_diagonal_only_rates.xml
# keywords: skyline
state	Posterior   	Prior       	Likelihood  	rootHeight  	age(root)   	Rate        
0	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
100	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
200	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
300	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
400	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
500	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
600	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
700	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
800	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
900	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
1000	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-

Operator analysis
Operator                                          Tuning   Count      Time     Time/Op  Pr(accept) Smoothed_Pr(accept)
VanillaHMC(branchRates.rates)                     0.0     990        0        0.0      0.0         0.0

Admittedly it does take less time to get the wrong answer (11s -> 5s).

examples/hmctest/hmctest.cpp

bredelings · 2025-08-20T01:18:24Z

Hmm... I am still getting bad results for WNV_skyline_HMC_diagonal_only_rates.xml. The acceptance rate is still 0.

bredelings · 2025-08-20T21:55:54Z

OK, so now it works with -beagle_SSE_off. However, with SSE it SEGFAULTs. The reason seems to be that in void BeagleCPU4StateSSEImpl<BEAGLE_CPU_4_SSE_DOUBLE>::accumulateDerivativesImpl(..) the pointer grandNumeratorDerivTmp + k + offset is not 16-byte-aligned with offset is odd:

    for (; k < kPatternCount - 1; k += 2) {

        V_Real numerator = VEC_LOAD(grandNumeratorDerivTmp + k + offset);
        V_Real denominator = VEC_LOAD(grandDenominatorDerivTmp + k + offset);
        V_Real derivative = VEC_DIV(numerator, denominator);
        V_Real patternWeight = VEC_LOAD(gPatternWeights + k);

        if (DoDerivatives) {
            VEC_STOREU(outDerivatives + k, derivative);
        }

Here grandNumeratorDerivTmp is 16-byte-aligned and k is even. But offset can be odd. Since VEC_LOAD expands to _mm_load_pd I think this requires 16-byte alignment.

Hmm... its also possible that gPatternWeights + k should also be gPatternWeights + k + offset, and the same for outDerivative.

It looks like there is now an instruction called _mm_loadu_pd that allows unaligned accesses. The unaligned version was slower on older CPUs, but on modern CPUs the difference might be much lower.

…ich however we do not

…umbers to avoid segfault on linux

bredelings · 2025-08-21T01:54:34Z

So now it works on my computer and gives the same results as on the hmc-clock branch. One weird thing is that the run time (measured using the time command) is like 8.84s for this branch, but 9.15s for hmc-clock. However, this branch is using 1100% CPU, whereas on hmc-clock it's using about 240% CPU.

bredelings · 2025-08-21T01:54:48Z

I also see that the aarch64 build is failing now...

bredelings · 2025-08-21T02:13:39Z

Seems like updating the image from ubuntu-20.04 to ubuntu-22.04 may have fixed the aarch64 build.

libhmsbeagle/CPU/BeagleCPU4StateSSEImpl.hpp

bredelings · 2025-08-22T13:30:08Z

I tried to see why the multithreaded run isn't faster, and I found that with 6 threads it takes 5.8 seconds, but it defaults to 24 threads which takes 8.8 seconds. So maybe the changes that lower the magic multi-threading limit should be reverted. Or even raised (although I didn't do any benchmarks here).

The weird thing is that all 24 threads appear to be continuously busy. In other cases, the program could be slow because one thread takes a long time, but other threads finish early and are doing nothing. However, here it seems like creating more threads also creates more work. Could some of the threads be doing duplicate work?

…up/beagle-lib into multi-threaded-derivative

bredelings commented Aug 19, 2025

View reviewed changes

libhmsbeagle/CPU/BeagleCPUImpl.hpp Outdated Show resolved Hide resolved

bredelings marked this pull request as draft August 19, 2025 15:27

bredelings force-pushed the multi-threaded-derivative branch 4 times, most recently from b9aed21 to f34e746 Compare August 19, 2025 16:04

xji3 added 3 commits August 19, 2025 12:09

make edge derivative reduction multi-threaded

37f7c94

make action edge derivative reduction multi-threaded

1b1c605

Check compilation also on branches that are not master.

5a363ac

bredelings force-pushed the multi-threaded-derivative branch from f34e746 to 5a363ac Compare August 19, 2025 16:10

xji3 added 3 commits August 19, 2025 14:35

hmctest multi-threading

23cd0a1

fix edge reduction multi-threading

f7a7f02

lower magic multi-threading pattern per thread bounds

cde482a

bredelings commented Aug 20, 2025

View reviewed changes

examples/hmctest/hmctest.cpp Outdated Show resolved Hide resolved

xji3 added 3 commits August 20, 2025 14:16

4state edge reduction multi-thread respects offset

3bb484c

change hmctest to 2 rate categories

4e03a1a

let edge deriv reduction in 4StateSSE impl respect offsets

01c91d6

xji3 added 2 commits August 20, 2025 20:24

padding offset would fix the issue if we pad patternCount for SSE, wh…

160d862

…ich however we do not

patch: padding edge derivative multi-threading cache offset to even n…

e856a39

…umbers to avoid segfault on linux

Try ubuntu-22.04 for ARM

62e3a65

bredelings commented Aug 21, 2025

View reviewed changes

libhmsbeagle/CPU/BeagleCPU4StateSSEImpl.hpp Show resolved Hide resolved

xji3 added 2 commits August 24, 2025 14:25

revert magic numbers

6098928

Merge branch 'multi-threaded-derivative' of https://github.com/ji-gro…

cb73b92

…up/beagle-lib into multi-threaded-derivative

xji3 marked this pull request as ready for review August 24, 2025 19:45

xji3 merged commit edfb106 into beagle-dev:hmc-clock Aug 24, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-threaded derivative #230

Multi-threaded derivative #230

Uh oh!

bredelings commented Aug 19, 2025

Uh oh!

Uh oh!

bredelings commented Aug 19, 2025

Uh oh!

bredelings commented Aug 19, 2025 •

edited

Loading

Uh oh!

bredelings commented Aug 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

bredelings commented Aug 20, 2025

Uh oh!

bredelings commented Aug 20, 2025

Uh oh!

bredelings commented Aug 21, 2025

Uh oh!

bredelings commented Aug 21, 2025

Uh oh!

bredelings commented Aug 21, 2025

Uh oh!

Uh oh!

bredelings commented Aug 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Multi-threaded derivative #230

Multi-threaded derivative #230

Uh oh!

Conversation

bredelings commented Aug 19, 2025

Uh oh!

Uh oh!

bredelings commented Aug 19, 2025

Uh oh!

bredelings commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bredelings commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bredelings commented Aug 20, 2025

Uh oh!

bredelings commented Aug 20, 2025

Uh oh!

bredelings commented Aug 21, 2025

Uh oh!

bredelings commented Aug 21, 2025

Uh oh!

bredelings commented Aug 21, 2025

Uh oh!

Uh oh!

bredelings commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bredelings commented Aug 19, 2025 •

edited

Loading

bredelings commented Aug 19, 2025 •

edited

Loading

bredelings commented Aug 22, 2025 •

edited

Loading