Skip to content

xtimer: fix mutex unlocking in _mutex_timeout#6428

Merged
OlegHahm merged 1 commit intoRIOT-OS:masterfrom
OTAkeys:fix/xtimer_mutex_infinite_loop
Jan 20, 2017
Merged

xtimer: fix mutex unlocking in _mutex_timeout#6428
OlegHahm merged 1 commit intoRIOT-OS:masterfrom
OTAkeys:fix/xtimer_mutex_infinite_loop

Conversation

@vincent-d
Copy link
Member

Changes proposed in #6427.

@miri64 miri64 assigned miri64 and kaspar030 and unassigned miri64 Jan 19, 2017
@miri64 miri64 added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: timers Area: timer subsystems labels Jan 19, 2017
@miri64 miri64 added this to the Release 2017.01 milestone Jan 19, 2017
mt->timeout = 1;
sched_set_status(mt->thread, STATUS_PENDING);
list_remove(&mt->mutex->queue, (list_node_t *)&mt->thread->rq_entry);
if (mt->mutex->queue.next == NULL) {
Copy link
Member

@lebrush lebrush Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good catch! 👍

I ignored that the MUTEX_LOCK is not always the tip of the list! It is only when the mutex is owned by one thread. If there are more threads waiting, the tip is NULL. I guess this is done to be able to use thread_add_to_list cleanly.

There's one race condition still: if the timer fires between setting the timer and locking the mutex might create inconsistencies in the mutex queue. Just test that list_remove really removed a node

list_node_t* node = list_remove(...)
if (node != NULL && mt->mutex->queue.next == NULL) {
    // we were the last node in the queue but the mutex shall still be owned by someone else
    mt->mutex->queue.next = MUTEX_LOCKED;
}

@OlegHahm OlegHahm added the CI: needs squashing Commits in this PR need to be squashed; If set, CI systems will mark this PR as unmergable label Jan 19, 2017
Copy link
Member

@miri64 miri64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proxy-ACK, please squash

mt->timeout = 1;
list_node_t *node = list_remove(&mt->mutex->queue,
(list_node_t *)&mt->thread->rq_entry);
if (node != NULL && mt->mutex->queue.next == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please but each condition into parentheses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@vincent-d vincent-d force-pushed the fix/xtimer_mutex_infinite_loop branch from b9fd4b7 to eacffdf Compare January 19, 2017 17:27
@vincent-d
Copy link
Member Author

@OlegHahm comment addressed and squashed

@lebrush
Copy link
Member

lebrush commented Jan 19, 2017 via email

@miri64 miri64 removed the CI: needs squashing Commits in this PR need to be squashed; If set, CI systems will mark this PR as unmergable label Jan 19, 2017
Copy link
Member

@OlegHahm OlegHahm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lebrush requested to wait

@OlegHahm
Copy link
Member

@lebrush, waiting for what?

@lebrush
Copy link
Member

lebrush commented Jan 20, 2017

Sorry for the delay. I wanted to make sure that all bugs here are addressed, since I introduced the feature and the bugs :-)

After reviewing the code once more I made a table with the following corner cases: what happens if the timer fires before calling mutex_lock and what if fires after calling mutex_lock. Since the code within xtimer_mutex_lock_timeout is not atomic we have some possible race conditions.

Mutex was Before mutex_lock After mutex_lock
UNLOCKED mutex: LOCKED (1), return: -1 -
LOCKED (1) mutex_lock blocks until unlocked; return -1 mutex: LOCKED (1), return -1 (with this patch)
LOCKED (2+) mutex_lock blocks until unlocked; return -1 mutex: LOCKED (2+); return -1

This patch solves the case LOCKED (1) | After mutex_lock.

Mutex was Before mutex_lock After mutex_lock
UNLOCKED WRONG -
LOCKED (1) WRONG OK
LOCKED (2+) WRONG OK

The implementation of xtimer might decide to spin if the timeout is to small. Hence we need to solve the Before column as well. One solution would be to replace the if (timeout != 0) with:

if (timeout > XTIMER_BACKOFF) 

With these two fixes then we are:

Mutex was Before mutex_lock After mutex_lock
UNLOCKED OK -
LOCKED (1) OK OK
LOCKED (2+) OK OK

The last consideration I could think of is the case that the timer triggers between the return of mutex_lock (isr safe) and xtimer_remove (isr safe). The mutex would have the correct state but the method would return -1. How probable is that this happens?

I think it is necessary to include a fix for the XTIMER_BACKOFF as well. It's not as likely to happen as this one but still. Maybe a separate PR in order not to block this one any longer?

@lebrush
Copy link
Member

lebrush commented Jan 20, 2017

TL;DR I would merge this one and contemplate the other cases in i.e. #6441

@OlegHahm
Copy link
Member

What was the initial reason not to call mutex_unlock() in the callback?

@lebrush
Copy link
Member

lebrush commented Jan 20, 2017

Thread A has the mutex, Thread B waits for it. Thread C sets a mutex lock with timeout. If the callback occurs before A has released the mutex and you call mutex_unlock both threads A and B "have" the mutex now, right?

@OlegHahm
Copy link
Member

Ah, I think I misunderstood xtimer_mutex_lock_timeout() (and I think the documentation is not very accurate). I thought the function would lock the timeout for a certain time (which doesn't have much use cases, I guess).

@lebrush
Copy link
Member

lebrush commented Jan 20, 2017

Good point. Will improve the docs in #6441

list_node_t *node = list_remove(&mt->mutex->queue,
(list_node_t *)&mt->thread->rq_entry);
if ((node != NULL) && (mt->mutex->queue.next == NULL)) {
mt->mutex->queue.next = MUTEX_LOCKED;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be MUTEX_LOCKED because we assume that someone still owns the mutex, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

When the mutex is unlocked, queue.next is NULL. mutex -> NULL
When the mutex is locked by 1 thread queue.next is MUTEX_LOCKED. mutex -> LOCKED
When the mutex is locked by 2+ threads queue.next is the mt->thread->rq_entry of a thread and the last .next is NULL. mutex -> thread -> (thread -> ...) NULL

If node != NULL we were removed from the list, hence 2+ threads where in the list. At the same time if queue.next is NULL means that we were the last thread in the list and we need to set queue.next to MUTEX_LOCKED as there's now 1 thread still owning the mutex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, otherwise you could have locked it and then the timeout never fire.

Copy link
Member

@OlegHahm OlegHahm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@OlegHahm OlegHahm merged commit 9a38c10 into RIOT-OS:master Jan 20, 2017
@vincent-d vincent-d deleted the fix/xtimer_mutex_infinite_loop branch March 2, 2017 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: timers Area: timer subsystems Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants