RFC-14: MCS: Adding budget limit thresholds to endpoints for SC Donation #24

lsf37 · 2024-06-17T01:42:53Z

See https://sel4.atlassian.net/browse/RFC-14 Signed-off-by: Gerwin Klein <gerwin.klein@proofcraft.systems>

kent-mcleod · 2024-07-22T10:42:11Z

If the server thread was told how much budget was available on the donated SC when it is unblocked from seL4_Recv, then it could immediately return an error to the client based on some policy it can implement itself. This in combination with the proposed mechanism that allows a well-behaving client to yield-until-budget before making the call gives more power to both client and server and doesn't require adding more complexity to endpoints.

A single threshold value on an endpoint would have to be suitable for all calls over that endpoint whereas a server implementing a policy at user level could enforce a different threshold limit that's specific to each unique call.

I couldn't find a user-level approach discussed in the proposal or the thesis or the comments.

lsf37 · 2024-07-22T21:42:49Z

I would also welcome some more thinking on alternatives to this approach. Doesn't mean we have to take them, but the design space is large and the RFC at least does not show exploration of much of it -- it's the prototype of one implementation that happens to work, but for me it has two main problems:

To my understanding we are now introducing a loop into the IPC fastpath(!) that is a priory only bounded by the absolute maximum number of refills a user can create a scheduling context with. As the performance number tables show, that means there is no "IPC takes x cycles" for specific scenarios any more -- it all depends on how things were scheduled before and how many refills happen to be in the scheduling context. The user does not really have much control over that apart from setting the maximum number.
The other aspect is verification cost: this is the IPC path -- even small modifications are very expensive there, and even the simplest version of the RFC is not a small modification. This would need significant verification funding first to proceed (it does not look to me like this can be made conditional easily -- if it can, then it could be made available in non-verification configs first).

Would it make sense the keep a "remaining overall budget" field in the SC that is updated each time an SC is charged and re-charged? (I.e. explicitly keep the sum of all refill budgets). This would avoid the loop and make Kent's alternative of a server (and even well-behaved client) checking the remaining budget quite fast and easy to implement without introducing new kernel mechanisms on endpoints.

Edit: sorry, that last sentence doesn't make sense. The sum of all refill budgets would always be the overall budget of the SC. It's just the potentially mergable head refills that we want to know, which is what the loop currently computes.

lsf37 · 2024-08-04T00:31:01Z

Status update from 2024-07-25 TSC meeting: awaiting change of prototype implementation by Indan.

Indanz · 2024-08-05T11:43:52Z

If the server thread was told how much budget was available on the donated SC when it is unblocked from seL4_Recv, then it could immediately return an error to the client based on some policy it can implement itself. This in combination with the proposed mechanism that allows a well-behaving client to yield-until-budget before making the call gives more power to both client and server and doesn't require adding more complexity to endpoints.

The problem is that we don't know how much budget is available when the server gets unblocked. To know that, we need to take a timestamp on the fast path, which currently doesn't happen.

So from a performance point of view your proposal would require all IPC calls to take a timestamp and an extra yield-until-budget syscall for tasks wanting this feature.

With the proposal the cost is one extra if-check to check for budget threshold for all IPC calls, and taking the timestamp only when a budget is set.

Indanz · 2024-08-05T12:11:51Z

To my understanding we are now introducing a loop into the IPC fastpath(!)

I'm getting rid of those new loops and will go to the slow path if the head budget isn't sufficient.

The other aspect is verification cost

The loop to gather enough budget is the same loop as the one at the end of refill_budget_check, except that the budget threshold is used instead of MIN_BUDGET. refill_unblock_check already merges overlapping refills (I don't understand how they can happen though, seems easier to avoid than to fix afterwards to me).

But yes, if you want to proof that budget threshold feature does what it should, the verification cost will be high no matter how it's implemented.

it does not look to me like this can be made conditional easily -- if it can, then it could be made available in non-verification configs first

I think it could be made conditional easily, but the question is whether you want that.

kent-mcleod · 2024-08-05T12:20:42Z

The problem is that we don't know how much budget is available when the server gets unblocked. To know that, we need to take a timestamp on the fast path, which currently doesn't happen.

There is currently a timestamp taken on every slowpath entry to the kernel because every time the kernel is entered it calculates whether the time remaining in the head refill is greater than MIN_BUDGET: refill_head(sc)->rAmount - usage in refill_sufficient(). The result of this calculation is also the same as the amount of time until the currently programmed timer interrupt will fire and is the amount of budget available to the server before a timeout fault would be generated.

If this check isn't also in the fastpath then it would need to be added regardless.

Indanz · 2024-08-05T13:24:16Z

If this check isn't also in the fastpath then it would need to be added regardless.

Why? We already paid most of the cost of a syscall, handling the fast path is always faster than doing anything else. MIN_BUDGET applies to generic system calls as it is bound to WCET, not fastpath ones, which are on the order of one microsecond.

At worst the called task will be interrupted within MIN_BUDGET by the timer interrupt to end its time-slice. You can usually do 10 or more IPC calls within MIN_BUDGET time.

IPC fastpath is a context switch without SC change, and because of that it can avoid doing anything timing related at the moment.

(For SMP I think we should take the timestamp before taking the kernel lock, see #844, and always taking the timestamp at entry would simplify some things, so I'm not against it in principle. But it will add extra overhead, especially on RISCV where it usually traps.)

kent-mcleod · 2024-08-05T23:21:20Z

Why? We already paid most of the cost of a syscall, handling the fast path is always faster than doing anything else. MIN_BUDGET applies to generic system calls as it is bound to WCET, not fastpath ones, which are on the order of one microsecond.

In the kernel design the fastpaths are each optimizations of a specific slowpath kernel operation. So if refill_sufficient() would return false in the slowpath and result in a thread switch, then the fastpath would have to produce the same thread switch, or else fall back to the slowpath. To facilitate this, the fastpath needs to perform the same check.

IPC fastpath is a context switch without SC change, and because of that it can avoid doing anything timing related at the moment.

It arguably can't avoid performing the budget check. The kernel has to ensure there is enough remaining budget to get back into the kernel whenever it returns to user level. It has to guard against a thread calling late enough in its time slice and amplifying its budget via a non-preeptable kernel operation.

I think there's a design inconsistency here where the fastpath is always expected to behave the same as a particular path through the slowpath, but the slowpath will now abort an operation if it doesn't have enough time to complete while the fastpath is hoped to be fastenough to always have enough time available to complete.

Shouldn't the fastpath also abort an operation if it doesn't have enough time to complete?

We either want to assume that:

the fastpath time rounds to zero (and not needing the check as a consequence),
the fastpath time doesn't round to zero and so a check is needed, and because of the design requirement that the fastpath follows the same behavior as the slowpath then the check is the same as the check in the slowpath. In this case maybe we could define the check to be parameterized by input arguments to the kernel, and in the case where it's the fastpath condition the check checks a smaller time value?

Indanz · 2024-08-24T15:52:37Z

Thinking about time is always tricky, hence my delayed response.

It arguably can't avoid performing the budget check. The kernel has to ensure there is enough remaining budget to get back into the kernel whenever it returns to user level. It has to guard against a thread calling late enough in its time slice and amplifying its budget via a non-preeptable kernel operation.

The kernel does not check at syscall entry if there is a higher priority thread scheduled within WCET time. Hence the worst case expected scheduling delay is equal to WCET. Even if a task calls a fastpath syscall just before it's timeslice ends, the worst case scheduling delay for the next task is much less than WCET.

The budget amplification is negligible. However, avoiding that amplification will not reduce the total amount of kernel operation time, though it might avoid taking a timer interrupt. I'm not convinced that stealing MIN_BUDGET budget from a task is better that letting it overrun by a fraction.

We either want to assume that:

1. the fastpath time rounds to zero (and not needing the check as a consequence),

This is the assumption I am making. Scheduling is at 1 us granularity currently, so anything close to that is a rounding error.

2. the fastpath time doesn't round to zero and so a check is needed, and because of the design requirement that the fastpath follows the same behavior as the slowpath then the check is the same as the check in the slowpath.  In this case maybe we could define the check to be parameterized by input arguments to the kernel, and in the case where it's the fastpath condition the check checks a smaller time value?

I have the feeling that the kernel currently abuses WCET for kernel entry and exit overhead, while that should be separate values from WCET (the only explanation I have for MIN_BUDGET = 2 * WCET). As WCET is used everywhere, it's unclear what purpose each margin has.

In the end it's about what timing guarantees seL4 provides. In my view seL4 should be suitable for hard-realtime, which implies (ignoring IRQs):

Tasks start when they should. This currently has error Es = WCET.
Tasks get enough budget in a timely manner. This currently has error of at least Eb = Es + MIN_BUDGET because of the budget check. But because notifications cause LIFO scheduling behaviour, no guarantees whatsoever can be given at the moment.
Scheduling errors should not accumulate. Es and Eb are non-accumulating. (Notifications causing LIFO scheduling is exactly such a big problem because it does accumulate, but let's ignore it for this analysis.)

Note that no budget overruns happening is not a design goal in itself. It may be necessary to be able to make certain guarantees, but I'm not fully convinced it is yet. It does avoid accumulating scheduling delays in case other tasks of the same priority all do budget amplification. However, that error is at most WCET per task (much less in reality), while the budget check causes tasks to lose up to MIN_BUDGET per period. I'd argue that a scheduling jitter of MIN_BUDGET is much worse than a scheduling jitter bound by WCET and much less in practice.

In other words, currently tasks needing a budget of b need to set budget = b + Eb if they have a time limit and budget = b + MIN_BUDGET otherwise. Without the budget check they can set budget = b + Es or budget = b for all tasks. Overall time can be divided more efficiently without a budget check at each syscall entry and scheduling jitter will be lower. As far as I can remember, the original sporadic scheduling paper did not check budget at every kernel entry either.

Indanz · 2024-08-24T15:56:19Z

To clarify, I think a slight budget overrun is fine if it means the task will get that amount less budget next period, to avoid accumulating errors.

Indanz · 2025-08-13T10:46:34Z

Simplified implementation of budget thresholds can be found here: Indanz/seL4#2

Edit: To be clear, that only implement budget thresholds, NOT budget limits, which I think don't add enough value to implement. See Jira discussion.

lsf37 · 2025-12-02T00:22:45Z

The new draft implementation does look simpler, but it mixes in other behaviour changes and does break verification by changing the preemption point behaviour and some other parts in the draft that are outside ifdef guards.

I'm not necessarily against changing the preemption point behaviour in MCS, but it is unclear to me what consequences it has. It seems to me like this would change behaviour slightly (exit, take IRQ immediately, re-enter kernel). It also sounds to me like this would break the kernel entry/exit budget computations, because a single kernel call could now have the cost of entry -> preemption -> exit -> immediate entry -> IRQ handling/schedule new thread -> exit, whereas before it was entry -> preemption -> IRQ handling/schedule -> exit. To be fair, I'm not sure what the point of the entry/exit WCET is that MCS is doing, because it does not reflect the actual WCET and any thread can go over its budget roughly by the amount of the actual kernel WCET. If we have that, why do we try so hard to account for just entry/exit?

Should we just admit that we can't fully guarantee absence of budget overrun by kernel WCET and just leave that up to user-level analysis? That would simplify a lot of things in the MCS implementation in general. (It would also make a lot of verification work obsolete that has already been put in to do verify all that complexity, but that's not a good reason to not get rid of it).

I'm also not entirely sure about adding invocations to endpoints in general. This is a major deviation of the principle that you can only send messages to endpoints. If they have invocations, that would break the ability to simulate another cap via an endpoint, because however you encode that endpoint invocation is now something that is not available for other caps any more. There probably are some points in that design space (I haven't fully checked Indan's draft for this), but it does feel like a hack. Is there nothing cleaner that we can do?

pingerino

I've done a high level pass due to the upcoming TSC meeting.
I don't have a lot of time to look at this.

On the surface adding thresdholds might be a useful change, though I would really like to see some use cases to motivate the effort involved in merging and verifying the change.

I'm happy to discuss this further. I've used conventional comments to make my comments, and tried my best to not come across as unsupportive, while also being appropriately sceptical. Great to see work being done to further the MCS apis.

pingerino · 2025-12-02T05:41:09Z