Skip to content

Conversation

@bredelings
Copy link
Contributor

Allow multithreading for derivative operations. This PR is copying some changes from ji-group/action-dev.

@bredelings bredelings marked this pull request as draft August 19, 2025 15:27
@bredelings bredelings force-pushed the multi-threaded-derivative branch 4 times, most recently from b9aed21 to f34e746 Compare August 19, 2025 16:04
@bredelings
Copy link
Contributor Author

I made the CI checks for compilation run even if the branch isn't master. Let me know if I should remove this.

@bredelings bredelings force-pushed the multi-threaded-derivative branch from f34e746 to 5a363ac Compare August 19, 2025 16:10
@bredelings
Copy link
Contributor Author

bredelings commented Aug 19, 2025

First test:

@bredelings
Copy link
Contributor Author

bredelings commented Aug 19, 2025

Test: with WNV_skyline_HMC_diagonal_only_rates.xml, it seems like there is a bug. With hmc-clock, we get:

# BEAST v10.5.0-beta4 Prerelease #6bd3e4b65
# Generated Tue Aug 19 12:49:23 EDT 2025 [seed=1]
# -seed 1 -overwrite WNV_skyline_HMC_diagonal_only_rates.xml
# keywords: skyline
state	Posterior   	Prior       	Likelihood  	rootHeight  	age(root)   	Rate        
0	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
100	-29379.4329 	-3401.6519  	-25977.7809 	8.92510     	1998.70     	5.69209E-4  	-
200	-29367.3947 	-3406.0099  	-25961.3848 	8.92510     	1998.70     	5.63616E-4  	-
300	-29387.6408 	-3408.6317  	-25979.0091 	8.92510     	1998.70     	6.16E-4     	-
400	-29368.3503 	-3386.4367  	-25981.9135 	8.92510     	1998.70     	6.0875E-4   	-
500	-29369.9142 	-3391.6816  	-25978.2325 	8.92510     	1998.70     	5.98373E-4  	-
600	-29371.7778 	-3402.6489  	-25969.1289 	8.92510     	1998.70     	6.0299E-4   	-
700	-29377.6585 	-3406.8020  	-25970.8564 	8.92510     	1998.70     	5.94365E-4  	-
800	-29370.1016 	-3405.9741  	-25964.1275 	8.92510     	1998.70     	6.11914E-4  	-
900	-29388.7663 	-3410.7179  	-25978.0484 	8.92510     	1998.70     	6.00581E-4  	-
1000	-29371.4701 	-3397.6304  	-25973.8397 	8.92510     	1998.70     	5.93565E-4  	-

Operator analysis
Operator                                          Tuning   Count      Time     Time/Op  Pr(accept) Smoothed_Pr(accept)
VanillaHMC(branchRates.rates)                     0.233   990        44       0.04     0.7919      0.86        

But with this branch, we get:

# BEAST v10.5.0-beta4 Prerelease #6bd3e4b65
# Generated Tue Aug 19 12:50:50 EDT 2025 [seed=1]
# -seed 1 -overwrite WNV_skyline_HMC_diagonal_only_rates.xml
# keywords: skyline
state	Posterior   	Prior       	Likelihood  	rootHeight  	age(root)   	Rate        
0	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
100	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
200	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
300	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
400	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
500	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
600	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
700	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
800	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
900	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-
1000	-29674.0363 	-3468.5147  	-26205.5216 	8.92510     	1998.70     	1E-3        	-

Operator analysis
Operator                                          Tuning   Count      Time     Time/Op  Pr(accept) Smoothed_Pr(accept)
VanillaHMC(branchRates.rates)                     0.0     990        0        0.0      0.0         0.0         

Admittedly it does take less time to get the wrong answer (11s -> 5s).

@bredelings
Copy link
Contributor Author

Hmm... I am still getting bad results for WNV_skyline_HMC_diagonal_only_rates.xml. The acceptance rate is still 0.

@bredelings
Copy link
Contributor Author

OK, so now it works with -beagle_SSE_off. However, with SSE it SEGFAULTs. The reason seems to be that in void BeagleCPU4StateSSEImpl<BEAGLE_CPU_4_SSE_DOUBLE>::accumulateDerivativesImpl(..) the pointer grandNumeratorDerivTmp + k + offset is not 16-byte-aligned with offset is odd:

    for (; k < kPatternCount - 1; k += 2) {

        V_Real numerator = VEC_LOAD(grandNumeratorDerivTmp + k + offset);
        V_Real denominator = VEC_LOAD(grandDenominatorDerivTmp + k + offset);
        V_Real derivative = VEC_DIV(numerator, denominator);
        V_Real patternWeight = VEC_LOAD(gPatternWeights + k);

        if (DoDerivatives) {
            VEC_STOREU(outDerivatives + k, derivative);
        }

Here grandNumeratorDerivTmp is 16-byte-aligned and k is even. But offset can be odd. Since VEC_LOAD expands to _mm_load_pd I think this requires 16-byte alignment.

Hmm... its also possible that gPatternWeights + k should also be gPatternWeights + k + offset, and the same for outDerivative.

It looks like there is now an instruction called _mm_loadu_pd that allows unaligned accesses. The unaligned version was slower on older CPUs, but on modern CPUs the difference might be much lower.

@bredelings
Copy link
Contributor Author

So now it works on my computer and gives the same results as on the hmc-clock branch. One weird thing is that the run time (measured using the time command) is like 8.84s for this branch, but 9.15s for hmc-clock. However, this branch is using 1100% CPU, whereas on hmc-clock it's using about 240% CPU.

@bredelings
Copy link
Contributor Author

I also see that the aarch64 build is failing now...

@bredelings
Copy link
Contributor Author

Seems like updating the image from ubuntu-20.04 to ubuntu-22.04 may have fixed the aarch64 build.

@bredelings
Copy link
Contributor Author

bredelings commented Aug 22, 2025

I tried to see why the multithreaded run isn't faster, and I found that with 6 threads it takes 5.8 seconds, but it defaults to 24 threads which takes 8.8 seconds. So maybe the changes that lower the magic multi-threading limit should be reverted. Or even raised (although I didn't do any benchmarks here).

The weird thing is that all 24 threads appear to be continuously busy. In other cases, the program could be slow because one thread takes a long time, but other threads finish early and are doing nothing. However, here it seems like creating more threads also creates more work. Could some of the threads be doing duplicate work?

@xji3 xji3 marked this pull request as ready for review August 24, 2025 19:45
@xji3 xji3 merged commit edfb106 into beagle-dev:hmc-clock Aug 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants