Skip to content

Conversation

@lsf37
Copy link
Member

@lsf37 lsf37 commented Jun 17, 2024

Original Jira issue and discussion

Rendered version of the RFC text.

See https://sel4.atlassian.net/browse/RFC-14

Signed-off-by: Gerwin Klein <gerwin.klein@proofcraft.systems>
@lsf37 lsf37 force-pushed the 0140-budget-limit-threshold branch from dffaf1c to 0d64b60 Compare June 17, 2024 01:43
@kent-mcleod
Copy link
Member

If the server thread was told how much budget was available on the donated SC when it is unblocked from seL4_Recv, then it could immediately return an error to the client based on some policy it can implement itself. This in combination with the proposed mechanism that allows a well-behaving client to yield-until-budget before making the call gives more power to both client and server and doesn't require adding more complexity to endpoints.

A single threshold value on an endpoint would have to be suitable for all calls over that endpoint whereas a server implementing a policy at user level could enforce a different threshold limit that's specific to each unique call.

I couldn't find a user-level approach discussed in the proposal or the thesis or the comments.

@lsf37
Copy link
Member Author

lsf37 commented Jul 22, 2024

I would also welcome some more thinking on alternatives to this approach. Doesn't mean we have to take them, but the design space is large and the RFC at least does not show exploration of much of it -- it's the prototype of one implementation that happens to work, but for me it has two main problems:

  • To my understanding we are now introducing a loop into the IPC fastpath(!) that is a priory only bounded by the absolute maximum number of refills a user can create a scheduling context with. As the performance number tables show, that means there is no "IPC takes x cycles" for specific scenarios any more -- it all depends on how things were scheduled before and how many refills happen to be in the scheduling context. The user does not really have much control over that apart from setting the maximum number.

  • The other aspect is verification cost: this is the IPC path -- even small modifications are very expensive there, and even the simplest version of the RFC is not a small modification. This would need significant verification funding first to proceed (it does not look to me like this can be made conditional easily -- if it can, then it could be made available in non-verification configs first).

Would it make sense the keep a "remaining overall budget" field in the SC that is updated each time an SC is charged and re-charged? (I.e. explicitly keep the sum of all refill budgets). This would avoid the loop and make Kent's alternative of a server (and even well-behaved client) checking the remaining budget quite fast and easy to implement without introducing new kernel mechanisms on endpoints.

Edit: sorry, that last sentence doesn't make sense. The sum of all refill budgets would always be the overall budget of the SC. It's just the potentially mergable head refills that we want to know, which is what the loop currently computes.

@lsf37
Copy link
Member Author

lsf37 commented Aug 4, 2024

Status update from 2024-07-25 TSC meeting: awaiting change of prototype implementation by Indan.

@Indanz
Copy link
Contributor

Indanz commented Aug 5, 2024

If the server thread was told how much budget was available on the donated SC when it is unblocked from seL4_Recv, then it could immediately return an error to the client based on some policy it can implement itself. This in combination with the proposed mechanism that allows a well-behaving client to yield-until-budget before making the call gives more power to both client and server and doesn't require adding more complexity to endpoints.

The problem is that we don't know how much budget is available when the server gets unblocked. To know that, we need to take a timestamp on the fast path, which currently doesn't happen.

So from a performance point of view your proposal would require all IPC calls to take a timestamp and an extra yield-until-budget syscall for tasks wanting this feature.

With the proposal the cost is one extra if-check to check for budget threshold for all IPC calls, and taking the timestamp only when a budget is set.

@Indanz
Copy link
Contributor

Indanz commented Aug 5, 2024

  • To my understanding we are now introducing a loop into the IPC fastpath(!)

I'm getting rid of those new loops and will go to the slow path if the head budget isn't sufficient.

The other aspect is verification cost

The loop to gather enough budget is the same loop as the one at the end of refill_budget_check, except that the budget threshold is used instead of MIN_BUDGET. refill_unblock_check already merges overlapping refills (I don't understand how they can happen though, seems easier to avoid than to fix afterwards to me).

But yes, if you want to proof that budget threshold feature does what it should, the verification cost will be high no matter how it's implemented.

it does not look to me like this can be made conditional easily -- if it can, then it could be made available in non-verification configs first

I think it could be made conditional easily, but the question is whether you want that.

@kent-mcleod
Copy link
Member

The problem is that we don't know how much budget is available when the server gets unblocked. To know that, we need to take a timestamp on the fast path, which currently doesn't happen.

There is currently a timestamp taken on every slowpath entry to the kernel because every time the kernel is entered it calculates whether the time remaining in the head refill is greater than MIN_BUDGET: refill_head(sc)->rAmount - usage in refill_sufficient(). The result of this calculation is also the same as the amount of time until the currently programmed timer interrupt will fire and is the amount of budget available to the server before a timeout fault would be generated.

If this check isn't also in the fastpath then it would need to be added regardless.

@Indanz
Copy link
Contributor

Indanz commented Aug 5, 2024

If this check isn't also in the fastpath then it would need to be added regardless.

Why? We already paid most of the cost of a syscall, handling the fast path is always faster than doing anything else. MIN_BUDGET applies to generic system calls as it is bound to WCET, not fastpath ones, which are on the order of one microsecond.

At worst the called task will be interrupted within MIN_BUDGET by the timer interrupt to end its time-slice. You can usually do 10 or more IPC calls within MIN_BUDGET time.

IPC fastpath is a context switch without SC change, and because of that it can avoid doing anything timing related at the moment.

(For SMP I think we should take the timestamp before taking the kernel lock, see #844, and always taking the timestamp at entry would simplify some things, so I'm not against it in principle. But it will add extra overhead, especially on RISCV where it usually traps.)

@kent-mcleod
Copy link
Member

Why? We already paid most of the cost of a syscall, handling the fast path is always faster than doing anything else. MIN_BUDGET applies to generic system calls as it is bound to WCET, not fastpath ones, which are on the order of one microsecond.

In the kernel design the fastpaths are each optimizations of a specific slowpath kernel operation. So if refill_sufficient() would return false in the slowpath and result in a thread switch, then the fastpath would have to produce the same thread switch, or else fall back to the slowpath. To facilitate this, the fastpath needs to perform the same check.

IPC fastpath is a context switch without SC change, and because of that it can avoid doing anything timing related at the moment.

It arguably can't avoid performing the budget check. The kernel has to ensure there is enough remaining budget to get back into the kernel whenever it returns to user level. It has to guard against a thread calling late enough in its time slice and amplifying its budget via a non-preeptable kernel operation.

I think there's a design inconsistency here where the fastpath is always expected to behave the same as a particular path through the slowpath, but the slowpath will now abort an operation if it doesn't have enough time to complete while the fastpath is hoped to be fastenough to always have enough time available to complete.

Shouldn't the fastpath also abort an operation if it doesn't have enough time to complete?

We either want to assume that:

  1. the fastpath time rounds to zero (and not needing the check as a consequence),
  2. the fastpath time doesn't round to zero and so a check is needed, and because of the design requirement that the fastpath follows the same behavior as the slowpath then the check is the same as the check in the slowpath. In this case maybe we could define the check to be parameterized by input arguments to the kernel, and in the case where it's the fastpath condition the check checks a smaller time value?

@Indanz
Copy link
Contributor

Indanz commented Aug 24, 2024

Thinking about time is always tricky, hence my delayed response.

It arguably can't avoid performing the budget check. The kernel has to ensure there is enough remaining budget to get back into the kernel whenever it returns to user level. It has to guard against a thread calling late enough in its time slice and amplifying its budget via a non-preeptable kernel operation.

The kernel does not check at syscall entry if there is a higher priority thread scheduled within WCET time. Hence the worst case expected scheduling delay is equal to WCET. Even if a task calls a fastpath syscall just before it's timeslice ends, the worst case scheduling delay for the next task is much less than WCET.

The budget amplification is negligible. However, avoiding that amplification will not reduce the total amount of kernel operation time, though it might avoid taking a timer interrupt. I'm not convinced that stealing MIN_BUDGET budget from a task is better that letting it overrun by a fraction.

We either want to assume that:

1. the fastpath time rounds to zero (and not needing the check as a consequence),

This is the assumption I am making. Scheduling is at 1 us granularity currently, so anything close to that is a rounding error.

2. the fastpath time doesn't round to zero and so a check is needed, and because of the design requirement that the fastpath follows the same behavior as the slowpath then the check is the same as the check in the slowpath.  In this case maybe we could define the check to be parameterized by input arguments to the kernel, and in the case where it's the fastpath condition the check checks a smaller time value?

I have the feeling that the kernel currently abuses WCET for kernel entry and exit overhead, while that should be separate values from WCET (the only explanation I have for MIN_BUDGET = 2 * WCET). As WCET is used everywhere, it's unclear what purpose each margin has.

In the end it's about what timing guarantees seL4 provides. In my view seL4 should be suitable for hard-realtime, which implies (ignoring IRQs):

  1. Tasks start when they should. This currently has error Es = WCET.
  2. Tasks get enough budget in a timely manner. This currently has error of at least Eb = Es + MIN_BUDGET because of the budget check. But because notifications cause LIFO scheduling behaviour, no guarantees whatsoever can be given at the moment.
  3. Scheduling errors should not accumulate. Es and Eb are non-accumulating. (Notifications causing LIFO scheduling is exactly such a big problem because it does accumulate, but let's ignore it for this analysis.)

Note that no budget overruns happening is not a design goal in itself. It may be necessary to be able to make certain guarantees, but I'm not fully convinced it is yet. It does avoid accumulating scheduling delays in case other tasks of the same priority all do budget amplification. However, that error is at most WCET per task (much less in reality), while the budget check causes tasks to lose up to MIN_BUDGET per period. I'd argue that a scheduling jitter of MIN_BUDGET is much worse than a scheduling jitter bound by WCET and much less in practice.

In other words, currently tasks needing a budget of b need to set budget = b + Eb if they have a time limit and budget = b + MIN_BUDGET otherwise. Without the budget check they can set budget = b + Es or budget = b for all tasks. Overall time can be divided more efficiently without a budget check at each syscall entry and scheduling jitter will be lower. As far as I can remember, the original sporadic scheduling paper did not check budget at every kernel entry either.

@Indanz
Copy link
Contributor

Indanz commented Aug 24, 2024

To clarify, I think a slight budget overrun is fine if it means the task will get that amount less budget next period, to avoid accumulating errors.

@Indanz
Copy link
Contributor

Indanz commented Aug 13, 2025

Simplified implementation of budget thresholds can be found here: Indanz/seL4#2

Edit: To be clear, that only implement budget thresholds, NOT budget limits, which I think don't add enough value to implement. See Jira discussion.

@lsf37
Copy link
Member Author

lsf37 commented Dec 2, 2025

The new draft implementation does look simpler, but it mixes in other behaviour changes and does break verification by changing the preemption point behaviour and some other parts in the draft that are outside ifdef guards.

I'm not necessarily against changing the preemption point behaviour in MCS, but it is unclear to me what consequences it has. It seems to me like this would change behaviour slightly (exit, take IRQ immediately, re-enter kernel). It also sounds to me like this would break the kernel entry/exit budget computations, because a single kernel call could now have the cost of entry -> preemption -> exit -> immediate entry -> IRQ handling/schedule new thread -> exit, whereas before it was entry -> preemption -> IRQ handling/schedule -> exit. To be fair, I'm not sure what the point of the entry/exit WCET is that MCS is doing, because it does not reflect the actual WCET and any thread can go over its budget roughly by the amount of the actual kernel WCET. If we have that, why do we try so hard to account for just entry/exit?

Should we just admit that we can't fully guarantee absence of budget overrun by kernel WCET and just leave that up to user-level analysis? That would simplify a lot of things in the MCS implementation in general. (It would also make a lot of verification work obsolete that has already been put in to do verify all that complexity, but that's not a good reason to not get rid of it).

I'm also not entirely sure about adding invocations to endpoints in general. This is a major deviation of the principle that you can only send messages to endpoints. If they have invocations, that would break the ability to simulate another cap via an endpoint, because however you encode that endpoint invocation is now something that is not available for other caps any more. There probably are some points in that design space (I haven't fully checked Indan's draft for this), but it does feel like a hack. Is there nothing cleaner that we can do?

Copy link

@pingerino pingerino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a high level pass due to the upcoming TSC meeting.
I don't have a lot of time to look at this.

On the surface adding thresdholds might be a useful change, though I would really like to see some use cases to motivate the effort involved in merging and verifying the change.

I'm happy to discuss this further. I've used conventional comments to make my comments, and tried my best to not come across as unsupportive, while also being appropriately sceptical. Great to see work being done to further the MCS apis.

many systems. This is likely to simplify system design.

Currently, there are almost no restrictions on the amount of budget required to
enter a passive server. This almost inevitably leads to budget expiry, resulting

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "this almost inevitably leads to budget expiry"
suggestion: "this can lead to undesirably budget expiry in certain system designs"

I wouldn't say it is invevitable, that's up to the system. If you have a system where overruns are tolerated, or where you think you have the WCET mostly sorted but not completely, you can have occasional budget expiry and handle it. Whether it's desirable or not is up to the system design and is a matter of policy. The wording is a bit strong.


### Thresholds

This change regarding thresholds is motivated by a desire to prevent budget

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: this would be better worded as "to provide an option for users to configure passive servers without requiring timeout handlers"


This change regarding thresholds is motivated by a desire to prevent budget
expiry from occurring inside passive servers, improving worst-case performance
and response time analysis. Additionally, it would shift timeout exceptions into

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing about this improves worst-case perf or response time analysis - it allows you to design a different kind of system, but ultimatley your response time analysis relies on everything in the entire system. It allows you to do a different kind of analysis, where clients can't call into servers if they don't have enough budget. Perhaps there's an argument to be made to say that this kind of analysis has better worse case performance, but proving that is in the territory of real-time theorists.

(one could for example argue that to have these budget limits, you'd need a server WCET so high that your client budgets would be big enough to result in a worse outcome analysis wise - due to the nature of current WCET tooling - it could be faster to handle a timeout than to pad to the WCET)

This change can be motivated simply by stating you want to provide an option to build a different kind of system, where you do not need to wait for the timeout to deny the operation.


This RFC proposes a new mechanism, endpoint thresholds, which aim to prevent
budget expiry from occurring in passive servers, rather than reacting to it
after it occurs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this RFC is trying to build a compelling case, however I think this can be done without denigrating the existing api. If the proposal wanted to remove of the existing API, then it would need to argue why those mechanisms are not useful.

The motivation here would be improved greatly by adding that this is additional, demonstrating why it's needed, without saying all the other options are bad. Reacting is a legitiamte way to build a system, especially if your WCET is very loose. They have legitimate use cases depending on client policy.

being a true error case, rather than an unavoidable, but routine occurrence on
many systems. This is likely to simplify system design.

Currently, there are almost no restrictions on the amount of budget required to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "there are almost no"
suggestion: "there are no" - there's not a limit at all


### Budget Limits Guide

The optional budget limit change extends the meaning of an endpoint threshold

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can there be some use cases here to illustrate what this is for?


## Rationale and alternatives

A key benefit is that this change will support better schedulability analysis

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the claim is better schedulability analysis, then I'd really like to see the evidence.

Reduced server complexity is definitely a possibility, as you do not need a timeout handler, neccessarily. But you probably still need one if your WCET analysis is insufficient (this relies heavily on WCET being correct). If you have a WCET you are really certain of and it's really tight, you don't need thresholds as you won't hit them.

I'm not opposed to thresholds being added but increasing the size of endpoints needs to be weighed against the reality of this feature being used in practice. I'd love to see more illustrative examples or concrete analysis about schedulability.

more predictable behaviour is of particular benefit for schedulability analysis.
In particular, this predictability is highly important for real-time systems
that the MCS kernel is targeted at. Further, this change also moves timeout
faults into a true error case, rather than an unavoidable consequence of passive

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Timeout faults were never intended to only be for error handling.

@@ -0,0 +1,469 @@
<!--

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm coming to this PR very late, since it's being reviewed by the TSC this week. I'm not vehemently opposed to the change but I'm not yet convinced by the motivation and would love to see my questions answered. It's great to see work happening on the MCS api and I love to see it develop and evolve.

backward API compatibility. It would however, break existing binary
compatibility.

- Two true system calls, the existing `seL4_Yield()` and the new

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can avoid breaking binary compatibility that would be good.

I would suggest a simple seL4_YieldUntil(ticks_t available)

@pingerino
Copy link

To be fair, I'm not sure what the point of the entry/exit WCET is that MCS is doing, because it does not reflect the actual WCET and any thread can go over its budget roughly by the amount of the actual kernel WCET. If we have that, why do we try so hard to account for just entry/exit?

(1) The original intent was that we would one day have a real WCET and update it. Not sure what the status there is.
(2) The rest came about via tuning; threads would end up pretty late if you don't do this in my testing. It was a long time ago though and could be reevaluated with respect to (1).

@Indanz
Copy link
Contributor

Indanz commented Dec 2, 2025

I'm not necessarily against changing the preemption point behaviour in MCS, but it is unclear to me what consequences it has.

It will make pre-emptible sycalls run longer because of the extra overhead.

It seems to me like this would change behaviour slightly (exit, take IRQ immediately, re-enter kernel).

Usually there is no pending IRQ and it would just restart the current syscall.

It also sounds to me like this would break the kernel entry/exit budget computations, because a single kernel call could now have the cost of entry -> preemption -> exit -> immediate entry -> IRQ handling/schedule new thread -> exit, whereas before it was entry -> preemption -> IRQ handling/schedule -> exit.

No, it just does the usual cost of entry -> preemption -> exit. The budget is updated and checked at the next entry as done normally for all syscalls.

It is the current code that is weird and duplicates normal bookkeeping and IRQ checking in the pre-emption code to avoid running out of budget.

To be fair, I'm not sure what the point of the entry/exit WCET is that MCS is doing, because it does not reflect the actual WCET and any thread can go over its budget roughly by the amount of the actual kernel WCET. If we have that, why do we try so hard to account for just entry/exit?

Should we just admit that we can't fully guarantee absence of budget overrun by kernel WCET and just leave that up to user-level analysis? That would simplify a lot of things in the MCS implementation in general. (It would also make a lot of verification work obsolete that has already been put in to do verify all that complexity, but that's not a good reason to not get rid of it).

Agreed with all of that. I think the harm of enforcing MIN_BUDGET is bigger than the gain, see my discussion with Kent above. But we should continue this discussion in a new issue.

I'm also not entirely sure about adding invocations to endpoints in general.

I'm not doing that, I'm just saying it's a complication that needs some solution.

There probably are some points in that design space (I haven't fully checked Indan's draft for this), but it does feel like a hack. Is there nothing cleaner that we can do?

My hacky solution is making it a reply object invocation. The rationale being that for calls you need access to an EP cap and a reply cap anyway, and that this should be under the server's control.

The original RFC proposed a CNode invocation as a work-around, but that's worse I think.

Another way is using a magic nonsensical combination of flags values in seL4_MessageInfo_t to alter the behaviour of normal endpoint calls, but that seemed very ugly.

@Indanz
Copy link
Contributor

Indanz commented Dec 2, 2025

To be fair, I'm not sure what the point of the entry/exit WCET is that MCS is doing, because it does not reflect the actual WCET and any thread can go over its budget roughly by the amount of the actual kernel WCET. If we have that, why do we try so hard to account for just entry/exit?

(1) The original intent was that we would one day have a real WCET and update it. Not sure what the status there is. (2) The rest came about via tuning; threads would end up pretty late if you don't do this in my testing. It was a long time ago though and could be reevaluated with respect to (1).

I have the strong impression that WCET as used in MIN_BUDGET started as kernel entry/exit time, but ended up being WCET instead. Why else would MIN_BUDGET be twice WCET?

@lsf37
Copy link
Member Author

lsf37 commented Dec 2, 2025

My hacky solution is making it a reply object invocation. The rationale being that for calls you need access to an EP cap and a reply cap anyway, and that this should be under the server's control.

The original RFC proposed a CNode invocation as a work-around, but that's worse I think.

Indeed. Apologies, I misread the code there. I think an invocation on the reply cap is acceptable. Still a bit of a hack, but better than all alternatives so far.

@lsf37
Copy link
Member Author

lsf37 commented Dec 2, 2025

It also sounds to me like this would break the kernel entry/exit budget computations, because a single kernel call could now have the cost of entry -> preemption -> exit -> immediate entry -> IRQ handling/schedule new thread -> exit, whereas before it was entry -> preemption -> IRQ handling/schedule -> exit.

No, it just does the usual cost of entry -> preemption -> exit. The budget is updated and checked at the next entry as done normally for all syscalls.

It is the current code that is weird and duplicates normal bookkeeping and IRQ checking in the pre-emption code to avoid running out of budget.

I'm starting to be convinced. It'd be great to reduce complexity here. The main problem with pulling that into this RFC implementation is that it implies unscheduled verification work. @michaelmcinerney do you already have CRefine proofs for mcsPreemptionPoint()? How much work would it be to remove it as in here? Small enough to do it on the side?

@michaelmcinerney
Copy link

I'm starting to be convinced. It'd be great to reduce complexity here. The main problem with pulling that into this RFC implementation is that it implies unscheduled verification work. @michaelmcinerney do you already have CRefine proofs for mcsPreemptionPoint()? How much work would it be to remove it as in here? Small enough to do it on the side?

I do already have the proofs in CRefine for mcsPreemptionPoint() but to just remove it would be fairly straightforward, I think.

@pingerino
Copy link

I have the strong impression that WCET as used in MIN_BUDGET started as kernel entry/exit time, but ended up being WCET instead. Why else would MIN_BUDGET be twice WCET?

Yes, it was supposed to be enough time for a thread to get into and out of the kernel. Given we only theoretically had one estimate, this rounds up to 2 * WCET.

@pingerino
Copy link

A note on MIN_BUDGET: some practical experiments are needed here to quantify the impact on latency and precision if you don't set this. E.g, if a schedulable set of tasks is set up, what time does the highest prio task end up waking up. Modern hardware might mean this is very precise (back in the day I was supporting things like the now-defunct kzm) without this mechanism.

@Indanz
Copy link
Contributor

Indanz commented Dec 4, 2025

A note on MIN_BUDGET: some practical experiments are needed here to quantify the impact on latency and precision if you don't set this. E.g, if a schedulable set of tasks is set up, what time does the highest prio task end up waking up.

I did a simple analysis in an earlier comment above.

My practical experience is with a system with budgets in the order of 100us and a WCET of 10us, and then it causes a lot more scheduling jitter (as measured with RTP audio quality under heavy load). Setting WCET to 1 improved system behaviour and reduced jitter. (Sadly I don't have access to that system any more, so can't repeat the tests easily.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants