Skip to content

Add OpenMP Parallel Implementation for Belief Propagation Decoder#78

Open
0xSooki wants to merge 5 commits intoquantumgizmos:mainfrom
0xSooki:add-openmp-for-bp
Open

Add OpenMP Parallel Implementation for Belief Propagation Decoder#78
0xSooki wants to merge 5 commits intoquantumgizmos:mainfrom
0xSooki:add-openmp-for-bp

Conversation

@0xSooki
Copy link
Copy Markdown

@0xSooki 0xSooki commented Jun 5, 2025

This PR introduces OpenMP-based parallelization to the Belief Propagation (BP) decoder, significantly improving performance for large LDPC codes. Closes #72

Summary

  • Added OpenMP support for parallel execution of BP iterations
  • Added cython bindings for the underlying parallel cpp implementation

Testing

  • Added performance benchmarks across 1, 2, 4, and 8 threads, all parallel results verified against the serial implementation
  • Performance validated on 1200×600 LDPC matrices

Performance

Bason on the TEST(BpDecoderParallel, ThreadScaling) benchmark

Threads Parallel Time (μs) Serial Time (μs) Speedup vs Serial
1 13,445 24,284 1.81x
2 9,201 24,239 2.63x
4 6,953 24,310 3.50x
8 10,407 24,591 2.36x

@quantumgizmos
Copy link
Copy Markdown
Owner

Nice work. Do you have any intuition as to why there is a slow down from 4 ---> 8 threads?

  • Could it be that the extra cores on your device are efficiency cores or similar?

  • Perhaps furhter speedups could be obtained by investigating the shared vs private OpenMP variables across cores?

@0xSooki
Copy link
Copy Markdown
Author

0xSooki commented Jun 5, 2025

My intuition would be that with more threads an overhead of managing them also occurs and it makes use of the cache less efficiently. During writing my thesis something of similar sort also occurred thus I only used 5 threads in the end to reap the maximum speed benefits of parallelization. I will look into further speedup improvements.

@quantumgizmos
Copy link
Copy Markdown
Owner

Is the shared memory for each thread being recopied at each iteration I wonder?
I would have thought that the parallelisation overhead should be quite small once the thread pool (and associated memory for each thread) is initialised.

However, if the entire PCM is being copied to each thread at each iteration, I could see that this might cause quite some overhead.

@0xSooki
Copy link
Copy Markdown
Author

0xSooki commented Jun 5, 2025

I believe that class members should be shared by default (does not copy values). However I will try to benchmark them using the explicit shared keyword.

@quantumgizmos
Copy link
Copy Markdown
Owner

Another thing to try would be benchmarking on larger LDPC codes. In quantum error correction, we often decode codes over matries of 10,000+ columns. It's possible that the parallelisation overhead could be less of a bottleneck in this regime.

@0xSooki
Copy link
Copy Markdown
Author

0xSooki commented Jun 5, 2025

Ahh yes, with 15000 columns I was able to achieve the following results. This time, 8 threads did much better.

Threads Parallel Time (μs) Serial Time (μs) Speedup vs Serial
1 167,997 324,516 1.93x
2 95,153 336,669 3.54x
4 109,404 335,716 3.07x
8 78,741 324,567 4.12x

@quantumgizmos
Copy link
Copy Markdown
Owner

quantumgizmos commented Jun 5, 2025

This is great to see. I noticed your benchmark is running over a single random syndrome. Some syndromes are more difficult to decode than others, so this could be accounting for the increased speed at thread_count=4. Could you average over a larger number of cycles (eg. 1000 or so)? If necessary, to speed things up, you could increase the sparsity of the matrix and syndrome.

@0xSooki
Copy link
Copy Markdown
Author

0xSooki commented Jun 5, 2025

I have added a cycles parameter for the benchmarks and did one with a matrix of 10,000 columns over 1,000 cycles.

Threads Parallel avg (μs) Serial avg (μs) Speedup
1 91,745 184,536 2.01x
2 60,514 184,029 3.04x
4 42,364 184,181 4.35x
8 48,088 185,780 3.86x

@0xSooki
Copy link
Copy Markdown
Author

0xSooki commented Jun 10, 2025

I have tried some improvements like using locks, omp atomics, or storing the partial results in a matrix and reducing it afterward, thus eliminating the critical section. I managed to get a small speedup for more threads by using atomics for a run with the same specs as #78 (comment). It reduced the time from ~48,000μs to ~42,000μs for 8 threads

@0xSooki
Copy link
Copy Markdown
Author

0xSooki commented Jun 16, 2025

@quantumgizmos, just a gentle reminder that today was mentioned as the last day for reviewing open PRs. If it would be helpful to keep iterating, I’d be more than happy to continue working on it.

@quantumgizmos
Copy link
Copy Markdown
Owner

Hi @0xSooki Can you make a comment on #72 ? I can assign you to that Issue and the UnitaryHack team will be in touch about the bounty reward.

@quantumgizmos
Copy link
Copy Markdown
Owner

If you are interested on working on this beyond UnitaryHack, we can explore ways of furhter improving the OpenMp implementation. Let me know :)

@0xSooki
Copy link
Copy Markdown
Author

0xSooki commented Jun 22, 2025

Would be more than happy to further work on it. What would be your preferred way of communication?

@quantumgizmos
Copy link
Copy Markdown
Owner

@0xSooki Great. My email address is joschka@roffe.eu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add OpenMP Parallel Implementation for Belief Propagation Decoder

2 participants