Skip to content

cpu/fe310: don't call thread_yield when sched_active_thread is invalid#12109

Merged
kaspar030 merged 1 commit intoRIOT-OS:masterfrom
nmeum:pr/riscv_context_switch
Sep 16, 2019
Merged

cpu/fe310: don't call thread_yield when sched_active_thread is invalid#12109
kaspar030 merged 1 commit intoRIOT-OS:masterfrom
nmeum:pr/riscv_context_switch

Conversation

@nmeum
Copy link
Member

@nmeum nmeum commented Aug 28, 2019

Contribution description

As the comment above cpu_switch_context_exit notes:

sched_active_thread is not valid when cpu_switch_context_exit() is called.

Unfortunately, thread_yield(), which is called directly by
cpu_switch_context_exit(), uses sched_active_thread possibly resulting
in a null pointer dereference.

Solution: Trigger a software interrupt to perform a context switch and
let sched_run() determine the next valid thread from there.

Testing procedure

I personally noticed this while trying to use RIOT with a riscv simulator. However, since this might as well be a bug in the simulator is probably best to "test this" by reading the code.

Disclaimer: I am not a riscv expert, but I think it's pretty obvious that calling thread_yield() with an invalid sched_active_thread is a bad idea, unless you want to rely on the cpu not raising a trap/exception.

As the comment above cpu_switch_context_exit notes:

	sched_active_thread is not valid when cpu_switch_context_exit() is called.

Unfortunately, thread_yield(), which is called directly by
cpu_switch_context_exit(), uses sched_active_thread possibly resulting
in a null pointer dereference.

Solution: Trigger a software interrupt to perform a context switch and
let sched_run() determine the next valid thread from there.
@nmeum
Copy link
Member Author

nmeum commented Sep 4, 2019

CC: @kenrabold (wrote the initial hifive1 support)

@kenrabold
Copy link
Contributor

I see what you are changing here and agree that thread_yield() should not be called cpu_switch_context_exit(). However, there is something else going on with the context changing code now because thread tests that were passing previously (eg. thread_msg_block_race, thread_race, etc...) are now not working. This PR is fine by me to address the immediate issue raised, but there is still a problem that needs fixing. I'm just not sure what that is at the moment.

@nmeum
Copy link
Member Author

nmeum commented Sep 5, 2019

Interesting, thread_race passes on the simulator I initially tested this with, but I just tested it with a real hifive1 board and the context swap race condition test does seem to fail.

@nmeum
Copy link
Member Author

nmeum commented Sep 5, 2019

However, if I am not mistaken the test also fail on current git HEAD 999fffd without the changes proposed here.

@kenrabold
Copy link
Contributor

I dug into this issue deeper and found that the reason for the thread_race and thread_msg_block_race failures was that when thread_yield_higher invokes a SW interrupt, the FE310 can take up to 4-7 cycles to respond. As a result, assembly instructions after the call to thread_yield_higher in the current active thread will be executed resulting in possible old data within registers. The 2 test programs I referenced produce that effect and is similar to #8897

Your simulator would not have had this issue as I'm sure you are not introducing a latency in the simulation of interrupts.

To fix this, I added 8 nops after invoking the SW interrupt. I also addressed the thread_yield call and the issue referenced in #12110

My changes are here: https://github.com/kenrabold/RIOT/blob/pr_thread_yield_higher/cpu/fe310/cpu.c

You can update your PR with these changes, or I can request the PR to pull in these fixes.

Thanks for finding this and calling it out. Good bug

@miri64 miri64 added Area: cpu Area: CPU/MCU ports Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Platform: RISC-V Platform: This PR/issue effects RISC-V-based platforms labels Sep 10, 2019
@nmeum
Copy link
Member Author

nmeum commented Sep 10, 2019

Thanks for investigating this further and coming up with a fix.

You can update your PR with these changes, or I can request the PR to pull in these fixes.

Either way is fine with me. Maybe just submit a separate PR which depends on this one?

@nmeum
Copy link
Member Author

nmeum commented Sep 16, 2019

@miri64 since this was ack'ed by the person who originally added the hifive board support is there anything that prevents this from getting merged?

@miri64
Copy link
Member

miri64 commented Sep 16, 2019

Disclaimer: I am not a riscv expert, but I think it's pretty obvious that calling thread_yield() with an invalid sched_active_thread is a bad idea, unless you want to rely on the cpu not raising a trap/exception.

I would call myself even less of an expert, especially when it comes to cpu/board internals, so not sure why I was tagged. @kaspar030 @aabadie can one of you maybe test this? The fix looks fine to my limited understanding.

@kaspar030
Copy link
Contributor

I'll run some tests.

@kaspar030 kaspar030 added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Sep 16, 2019
@cladmi
Copy link
Contributor

cladmi commented Sep 16, 2019

If you want some reference test for this, there the test from #11759 and 'thread_float' that does call 'thread_yield' from interrupt context.

ARMv7 was changed to go to that behavior too #11887

@kaspar030
Copy link
Contributor

If you want some reference test for this, there the test from #11759 and 'thread_float' that does call 'thread_yield' from interrupt context.

The usual threading tests (that did not fail on master, too) run fine with this PR. So does "tests/isr_thread_yield_higher" from #11759.

ACK.

@kaspar030 kaspar030 merged commit 071c2b2 into RIOT-OS:master Sep 16, 2019
@kb2ma kb2ma added this to the Release 2019.10 milestone Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: cpu Area: CPU/MCU ports CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Platform: RISC-V Platform: This PR/issue effects RISC-V-based platforms Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants