Basic optimizations for debug builds. by tmiw · Pull Request #6 · peterbmarks/radae_nopy

tmiw · 2026-03-15T23:06:00Z

Basic optimizations in DSP and acquisition code. Seems to reduce percentage used by rade_acq_check_pilots and rade_acq_detect_pilots in perf by around 25% or so. Tested via:

$ perf record -g ./src/rade_demod_wav tx.wav rx.wav

and then viewing the generated report via perf report (ensuring we zoom into librade.so).

tmiw · 2026-03-15T23:07:40Z

BTW I suspect a lot of the slowness is because of how the rade_dsp functions work (i.e. they're not actually inlined unless optimization is turned on, and as a result we have overhead trying to return a non-trivial object back to the caller). perf still shows a lot of usage by rade_cadd and rade_cmul after this PR :(

tmiw · 2026-03-15T23:09:47Z

Should probably ping @drowe67 and @peterbmarks too.

drowe67 · 2026-03-15T23:15:15Z

@peterbmarks - pls hold off on merge on this one.

@tmiw - I think we are slipping into "RADE V1 maintenance mode" which I think we agreed at PLT was not our current strategy? I'm not convinced these optimisations are necessary.

Before we go down this path, does simply building with Release get us the performance we need? If there is a justification for optimisation, then happy to discuss it. If not - then we all have a lot of other high priority work to do and should focus there.

tmiw · 2026-03-16T23:36:44Z

FWIW, here's a comparison between main and this PR when built in Release mode:

main

mooneer@fedora:~/radae_nopy/build$ time ./src/rade_modulate_wav ../voice.wav tx.wav
Input: ../voice.wav  44100 Hz  1 ch  16-bit int
Speech input: 186549 samples @ 16000 Hz  (11.7 s)
rade_open: model_file=model19_check3/checkpoints/checkpoint_epoch_100.pth (ignored, using built-in weights)
rade_open: n_features_in=432 Nmf=960 Neoo=1152 n_eoo_bits=180
Modem frames: 98 + EOO
Output: tx.wav  11.9 s  (190464 bytes)

real    0m0.122s
user    0m0.106s
sys     0m0.008s
mooneer@fedora:~/radae_nopy/build$ time ./src/rade_demod_wav tx.wav rx.wav
Input: tx.wav  8000 Hz  1 ch  16-bit int
Modem input: 95232 samples @ 8000 Hz  (11.9 s)
rade_open: model_file=model19_check3/checkpoints/checkpoint_epoch_100.pth (ignored, using built-in weights)
rade_open: n_features_in=432 Nmf=960 Neoo=1152 n_eoo_bits=180
End-of-over at modem frame 99
Modem frames: 100   valid: 95
Output: rx.wav  11.3 s  (363200 bytes)

real    0m0.928s
user    0m0.908s
sys     0m0.005s
mooneer@fedora:~/radae_nopy/build$

This PR

mooneer@fedora:~/radae_nopy/build$ time ./src/rade_modulate_wav ../voice.wav tx.wav
Input: ../voice.wav  44100 Hz  1 ch  16-bit int
Speech input: 186549 samples @ 16000 Hz  (11.7 s)
rade_open: model_file=model19_check3/checkpoints/checkpoint_epoch_100.pth (ignored, using built-in weights)
rade_open: n_features_in=432 Nmf=960 Neoo=1152 n_eoo_bits=180
Modem frames: 98 + EOO
Output: tx.wav  11.9 s  (190464 bytes)

real    0m0.119s
user    0m0.106s
sys     0m0.005s
mooneer@fedora:~/radae_nopy/build$ time ./src/rade_demod_wav tx.wav rx.wav
Input: tx.wav  8000 Hz  1 ch  16-bit int
Modem input: 95232 samples @ 8000 Hz  (11.9 s)
rade_open: model_file=model19_check3/checkpoints/checkpoint_epoch_100.pth (ignored, using built-in weights)
rade_open: n_features_in=432 Nmf=960 Neoo=1152 n_eoo_bits=180
End-of-over at modem frame 99
Modem frames: 100   valid: 95
Output: rx.wav  11.3 s  (363200 bytes)

real    0m0.763s
user    0m0.745s
sys     0m0.004s
mooneer@fedora:~/radae_nopy/build$

I'd say ~20% improvement in RX but it's already pretty fast without this change, so we can defer review until later.

peterbmarks · 2026-03-16T23:39:28Z

Impressive! If it passes the tests I think we should merge it.

But I agree optimisation should mostly come later.

Peter

drowe67 · 2026-03-17T01:32:49Z

Thanks @tmiw. Couple of thoughts:

Actually I think your proposed changes are in the acquisition code? So should the test be run on a noise input, not a valid RADE V1 signal? OW you're just testing the first few frames.
How fast is the Python version? Trying to get a feel for what our CPU load targets are - rather than just optimising because we can. There are a lot of optimisations I can think of as well, but not sure it's worth the effort, and they mean risk of errors creeping in, plus your time and my time consumed coding and reviewing. Which is why current PLT policy is not to do this sort of work.
This code is hard to review by inspection, and I'm not sure how well covered this is by the current unit tests. A targeted unit test might be necc to really verify (but pls don't start that without further discussion).

#2 is the critical question I think.

drowe67 · 2026-03-17T01:41:49Z

@tmiw - as the work in this PR has not been signed off pls ensure the main branch of this repo is used for any distribution of the nopy library, e.g. freedv-gui, Flex etc. As per our PLT decision a few days ago (and at several other times) we don't want unreviewed RADE V1 code being distributed.

peterbmarks · 2026-03-17T02:18:59Z

I have a user who is (barely) able to run on a Pi 3 with 1GB of RAM. Leighton wrote just now:

Hi Peter,

I just finished compiling the latest code on an old stock standard Raspberry Pi 3 with 1G memory. I started the build with the latest clean headless image (trixie).

I wasn't able to get it running with ALSA as I had wanted (thinking that this might be lighter?) so I had to install pulseaudio and compile with the pulse audio libraries. The best way I can describe it is ALSA resulted in broken audio packets being transmitted - I didn't test on receive before moving to pulse.

Running in receive, without any sync, top shows the program CPU usage at 105%. In transmit mode, the program CPU usage drops to around 60%. Even with the CPU over 100% the PI3 was still responding promptly between receive and transmit.

I just did a quick test on-air with Joe. With RADE sync, the CPU was still up around 105%. There was some audio underrun with the occasional "pop" throughout reception and the decoder spent a few seconds catching up on buffered receive audio. Even with these limitations, I was able to clearly understand Joe (SNR was reporting 22db my end).

Anyway, I just thought that it would be worth letting you know as this seems promising for running on anything with a little more processing power (or even more functional on a Pi3 with some optimisation?).

When I get a chance, I will see if there is anything that I can do to squeeze some more out of the Pi.

Regards,
Leighton

drowe67 · 2026-03-17T02:38:49Z

That's an interesting data point Peter. Jean-Marc has told me the FARGAN Vocoder (which dominates theoretical CPU by a factor of 10) should run on a Pi 3 as minimum. If Leighton wishes to see more optimisation work done by the team pls encourage him to submit a feature request form.

peterbmarks · 2026-03-17T02:50:01Z

Couple of things..

A Pi 4 has 4 Cores so I think utilisation goes up to 400%. He's reporting 105% which doesn't seem too bad.
Having said that, he's noting audio under-run and asks a good question about different audio drivers are more efficient on Linux (Rasbarian).
Mooneer's changes are also in rade_dsp which might explain the impressive decode and encode increase in performance. @tmiw Was there another change, perhaps compiler optimisation that would explain this?
@drowe67 what's the procedure for code review? I don't have the skills. Are you the one?

Peter

drowe67 · 2026-03-17T02:55:55Z

@drowe67 what's the procedure for code review? I don't have the skills. Are you the one?

Yes, as agreed at our last PLT I will be signing off on this code before release.

drowe67 · 2026-03-17T02:56:34Z

Oops, sorry, pressed the wrong button 😃

tmiw · 2026-03-17T16:33:22Z

That's an interesting data point Peter. Jean-Marc has told me the FARGAN Vocoder (which dominates theoretical CPU by a factor of 10) should run on a Pi 3 as minimum. If Leighton wishes to see more optimisation work done by the team pls encourage him to submit a feature request form.

FR created: drowe67/freedv-gui#1254

Couple of things..

A Pi 4 has 4 Cores so I think utilisation goes up to 400%. He's reporting 105% which doesn't seem too bad.
Having said that, he's noting audio under-run and asks a good question about different audio drivers are more efficient on Linux (Rasbarian).

The current recommended audio engine these days is pipewire, with Pulse being the next best if pipewire isn't possible for whatever reason.

Mooneer's changes are also in rade_dsp which might explain the impressive decode and encode increase in performance. @tmiw Was there another change, perhaps compiler optimisation that would explain this?

The changes do remove some recursion that was added during initial debugging, which might make it easier for the compiler to optimize.

… help.

Basic optimizations for debug builds.

1a69b96

drowe67 closed this Mar 17, 2026

drowe67 reopened this Mar 17, 2026

drowe67 mentioned this pull request Mar 18, 2026

[Feature Request] Optimisation to run on Raspberry Pi drowe67/freedv-gui#1254

Open

tmiw added 7 commits March 19, 2026 17:17

Bypass rade_dsp for dot product and give the compiler some additional…

a2a67ae

… help.

Fix typos.

471106d

Eliminate compiler warnings from Opus.

db4508d

Fix additional compiler warning.

9b4c3ee

Implement Opus redirect for universal builds as well.

6957300

Add missed file.

decd79b

Fix universal build include dirs.

16cfa95

Conversation

tmiw commented Mar 15, 2026

Uh oh!

tmiw commented Mar 15, 2026

Uh oh!

tmiw commented Mar 15, 2026

Uh oh!

drowe67 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmiw commented Mar 16, 2026

main

This PR

Uh oh!

peterbmarks commented Mar 16, 2026

Uh oh!

drowe67 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drowe67 commented Mar 17, 2026

Uh oh!

peterbmarks commented Mar 17, 2026

Uh oh!

drowe67 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterbmarks commented Mar 17, 2026

Uh oh!

drowe67 commented Mar 17, 2026

Uh oh!

drowe67 commented Mar 17, 2026

Uh oh!

tmiw commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drowe67 commented Mar 15, 2026 •

edited

Loading

drowe67 commented Mar 17, 2026 •

edited

Loading

drowe67 commented Mar 17, 2026 •

edited

Loading