Skip to content

Basic optimizations for debug builds.#6

Open
tmiw wants to merge 8 commits intomainfrom
ms-optim-debug
Open

Basic optimizations for debug builds.#6
tmiw wants to merge 8 commits intomainfrom
ms-optim-debug

Conversation

@tmiw
Copy link
Copy Markdown
Collaborator

@tmiw tmiw commented Mar 15, 2026

Basic optimizations in DSP and acquisition code. Seems to reduce percentage used by rade_acq_check_pilots and rade_acq_detect_pilots in perf by around 25% or so. Tested via:

$ perf record -g ./src/rade_demod_wav tx.wav rx.wav

and then viewing the generated report via perf report (ensuring we zoom into librade.so).

@tmiw
Copy link
Copy Markdown
Collaborator Author

tmiw commented Mar 15, 2026

BTW I suspect a lot of the slowness is because of how the rade_dsp functions work (i.e. they're not actually inlined unless optimization is turned on, and as a result we have overhead trying to return a non-trivial object back to the caller). perf still shows a lot of usage by rade_cadd and rade_cmul after this PR :(

@tmiw
Copy link
Copy Markdown
Collaborator Author

tmiw commented Mar 15, 2026

Should probably ping @drowe67 and @peterbmarks too.

@drowe67
Copy link
Copy Markdown
Collaborator

drowe67 commented Mar 15, 2026

@peterbmarks - pls hold off on merge on this one.

@tmiw - I think we are slipping into "RADE V1 maintenance mode" which I think we agreed at PLT was not our current strategy? I'm not convinced these optimisations are necessary.

Before we go down this path, does simply building with Release get us the performance we need? If there is a justification for optimisation, then happy to discuss it. If not - then we all have a lot of other high priority work to do and should focus there.

@tmiw
Copy link
Copy Markdown
Collaborator Author

tmiw commented Mar 16, 2026

FWIW, here's a comparison between main and this PR when built in Release mode:

main

mooneer@fedora:~/radae_nopy/build$ time ./src/rade_modulate_wav ../voice.wav tx.wav
Input: ../voice.wav  44100 Hz  1 ch  16-bit int
Speech input: 186549 samples @ 16000 Hz  (11.7 s)
rade_open: model_file=model19_check3/checkpoints/checkpoint_epoch_100.pth (ignored, using built-in weights)
rade_open: n_features_in=432 Nmf=960 Neoo=1152 n_eoo_bits=180
Modem frames: 98 + EOO
Output: tx.wav  11.9 s  (190464 bytes)

real    0m0.122s
user    0m0.106s
sys     0m0.008s
mooneer@fedora:~/radae_nopy/build$ time ./src/rade_demod_wav tx.wav rx.wav
Input: tx.wav  8000 Hz  1 ch  16-bit int
Modem input: 95232 samples @ 8000 Hz  (11.9 s)
rade_open: model_file=model19_check3/checkpoints/checkpoint_epoch_100.pth (ignored, using built-in weights)
rade_open: n_features_in=432 Nmf=960 Neoo=1152 n_eoo_bits=180
End-of-over at modem frame 99
Modem frames: 100   valid: 95
Output: rx.wav  11.3 s  (363200 bytes)

real    0m0.928s
user    0m0.908s
sys     0m0.005s
mooneer@fedora:~/radae_nopy/build$

This PR

mooneer@fedora:~/radae_nopy/build$ time ./src/rade_modulate_wav ../voice.wav tx.wav
Input: ../voice.wav  44100 Hz  1 ch  16-bit int
Speech input: 186549 samples @ 16000 Hz  (11.7 s)
rade_open: model_file=model19_check3/checkpoints/checkpoint_epoch_100.pth (ignored, using built-in weights)
rade_open: n_features_in=432 Nmf=960 Neoo=1152 n_eoo_bits=180
Modem frames: 98 + EOO
Output: tx.wav  11.9 s  (190464 bytes)

real    0m0.119s
user    0m0.106s
sys     0m0.005s
mooneer@fedora:~/radae_nopy/build$ time ./src/rade_demod_wav tx.wav rx.wav
Input: tx.wav  8000 Hz  1 ch  16-bit int
Modem input: 95232 samples @ 8000 Hz  (11.9 s)
rade_open: model_file=model19_check3/checkpoints/checkpoint_epoch_100.pth (ignored, using built-in weights)
rade_open: n_features_in=432 Nmf=960 Neoo=1152 n_eoo_bits=180
End-of-over at modem frame 99
Modem frames: 100   valid: 95
Output: rx.wav  11.3 s  (363200 bytes)

real    0m0.763s
user    0m0.745s
sys     0m0.004s
mooneer@fedora:~/radae_nopy/build$

I'd say ~20% improvement in RX but it's already pretty fast without this change, so we can defer review until later.

@peterbmarks
Copy link
Copy Markdown
Owner

Impressive! If it passes the tests I think we should merge it.

But I agree optimisation should mostly come later.

Peter

@drowe67
Copy link
Copy Markdown
Collaborator

drowe67 commented Mar 17, 2026

Thanks @tmiw. Couple of thoughts:

  1. Actually I think your proposed changes are in the acquisition code? So should the test be run on a noise input, not a valid RADE V1 signal? OW you're just testing the first few frames.
  2. How fast is the Python version? Trying to get a feel for what our CPU load targets are - rather than just optimising because we can. There are a lot of optimisations I can think of as well, but not sure it's worth the effort, and they mean risk of errors creeping in, plus your time and my time consumed coding and reviewing. Which is why current PLT policy is not to do this sort of work.
  3. This code is hard to review by inspection, and I'm not sure how well covered this is by the current unit tests. A targeted unit test might be necc to really verify (but pls don't start that without further discussion).

#2 is the critical question I think.

@drowe67
Copy link
Copy Markdown
Collaborator

drowe67 commented Mar 17, 2026

@tmiw - as the work in this PR has not been signed off pls ensure the main branch of this repo is used for any distribution of the nopy library, e.g. freedv-gui, Flex etc. As per our PLT decision a few days ago (and at several other times) we don't want unreviewed RADE V1 code being distributed.

@peterbmarks
Copy link
Copy Markdown
Owner

I have a user who is (barely) able to run on a Pi 3 with 1GB of RAM. Leighton wrote just now:

Hi Peter,

I just finished compiling the latest code on an old stock standard Raspberry Pi 3 with 1G memory. I started the build with the latest clean headless image (trixie).

I wasn't able to get it running with ALSA as I had wanted (thinking that this might be lighter?) so I had to install pulseaudio and compile with the pulse audio libraries. The best way I can describe it is ALSA resulted in broken audio packets being transmitted - I didn't test on receive before moving to pulse.

Running in receive, without any sync, top shows the program CPU usage at 105%. In transmit mode, the program CPU usage drops to around 60%. Even with the CPU over 100% the PI3 was still responding promptly between receive and transmit.

I just did a quick test on-air with Joe. With RADE sync, the CPU was still up around 105%. There was some audio underrun with the occasional "pop" throughout reception and the decoder spent a few seconds catching up on buffered receive audio. Even with these limitations, I was able to clearly understand Joe (SNR was reporting 22db my end).

Anyway, I just thought that it would be worth letting you know as this seems promising for running on anything with a little more processing power (or even more functional on a Pi3 with some optimisation?).

When I get a chance, I will see if there is anything that I can do to squeeze some more out of the Pi.

Regards,
Leighton

@drowe67
Copy link
Copy Markdown
Collaborator

drowe67 commented Mar 17, 2026

That's an interesting data point Peter. Jean-Marc has told me the FARGAN Vocoder (which dominates theoretical CPU by a factor of 10) should run on a Pi 3 as minimum. If Leighton wishes to see more optimisation work done by the team pls encourage him to submit a feature request form.

@peterbmarks
Copy link
Copy Markdown
Owner

Couple of things..

  1. A Pi 4 has 4 Cores so I think utilisation goes up to 400%. He's reporting 105% which doesn't seem too bad.
    Having said that, he's noting audio under-run and asks a good question about different audio drivers are more efficient on Linux (Rasbarian).
  2. Mooneer's changes are also in rade_dsp which might explain the impressive decode and encode increase in performance. @tmiw Was there another change, perhaps compiler optimisation that would explain this?
  3. @drowe67 what's the procedure for code review? I don't have the skills. Are you the one?

Peter

@drowe67
Copy link
Copy Markdown
Collaborator

drowe67 commented Mar 17, 2026

@drowe67 what's the procedure for code review? I don't have the skills. Are you the one?

Yes, as agreed at our last PLT I will be signing off on this code before release.

@drowe67 drowe67 closed this Mar 17, 2026
@drowe67 drowe67 reopened this Mar 17, 2026
@drowe67
Copy link
Copy Markdown
Collaborator

drowe67 commented Mar 17, 2026

Oops, sorry, pressed the wrong button 😃

@tmiw
Copy link
Copy Markdown
Collaborator Author

tmiw commented Mar 17, 2026

That's an interesting data point Peter. Jean-Marc has told me the FARGAN Vocoder (which dominates theoretical CPU by a factor of 10) should run on a Pi 3 as minimum. If Leighton wishes to see more optimisation work done by the team pls encourage him to submit a feature request form.

FR created: drowe67/freedv-gui#1254

Couple of things..

  1. A Pi 4 has 4 Cores so I think utilisation goes up to 400%. He's reporting 105% which doesn't seem too bad.
    Having said that, he's noting audio under-run and asks a good question about different audio drivers are more efficient on Linux (Rasbarian).

The current recommended audio engine these days is pipewire, with Pulse being the next best if pipewire isn't possible for whatever reason.

  1. Mooneer's changes are also in rade_dsp which might explain the impressive decode and encode increase in performance. @tmiw Was there another change, perhaps compiler optimisation that would explain this?

The changes do remove some recursion that was added during initial debugging, which might make it easier for the compiler to optimize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants