-
Notifications
You must be signed in to change notification settings - Fork 7
RFC-14: MCS: Adding budget limit thresholds to endpoints for SC Donation #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
See https://sel4.atlassian.net/browse/RFC-14 Signed-off-by: Gerwin Klein <gerwin.klein@proofcraft.systems>
dffaf1c to
0d64b60
Compare
|
If the server thread was told how much budget was available on the donated SC when it is unblocked from seL4_Recv, then it could immediately return an error to the client based on some policy it can implement itself. This in combination with the proposed mechanism that allows a well-behaving client to yield-until-budget before making the call gives more power to both client and server and doesn't require adding more complexity to endpoints. A single threshold value on an endpoint would have to be suitable for all calls over that endpoint whereas a server implementing a policy at user level could enforce a different threshold limit that's specific to each unique call. I couldn't find a user-level approach discussed in the proposal or the thesis or the comments. |
|
I would also welcome some more thinking on alternatives to this approach. Doesn't mean we have to take them, but the design space is large and the RFC at least does not show exploration of much of it -- it's the prototype of one implementation that happens to work, but for me it has two main problems:
Would it make sense the keep a "remaining overall budget" field in the SC that is updated each time an SC is charged and re-charged? (I.e. explicitly keep the sum of all refill budgets). This would avoid the loop and make Kent's alternative of a server (and even well-behaved client) checking the remaining budget quite fast and easy to implement without introducing new kernel mechanisms on endpoints. Edit: sorry, that last sentence doesn't make sense. The sum of all refill budgets would always be the overall budget of the SC. It's just the potentially mergable head refills that we want to know, which is what the loop currently computes. |
|
Status update from 2024-07-25 TSC meeting: awaiting change of prototype implementation by Indan. |
The problem is that we don't know how much budget is available when the server gets unblocked. To know that, we need to take a timestamp on the fast path, which currently doesn't happen. So from a performance point of view your proposal would require all IPC calls to take a timestamp and an extra yield-until-budget syscall for tasks wanting this feature. With the proposal the cost is one extra if-check to check for budget threshold for all IPC calls, and taking the timestamp only when a budget is set. |
I'm getting rid of those new loops and will go to the slow path if the head budget isn't sufficient.
The loop to gather enough budget is the same loop as the one at the end of But yes, if you want to proof that budget threshold feature does what it should, the verification cost will be high no matter how it's implemented.
I think it could be made conditional easily, but the question is whether you want that. |
There is currently a timestamp taken on every slowpath entry to the kernel because every time the kernel is entered it calculates whether the time remaining in the head refill is greater than If this check isn't also in the fastpath then it would need to be added regardless. |
Why? We already paid most of the cost of a syscall, handling the fast path is always faster than doing anything else. At worst the called task will be interrupted within IPC fastpath is a context switch without SC change, and because of that it can avoid doing anything timing related at the moment. (For SMP I think we should take the timestamp before taking the kernel lock, see #844, and always taking the timestamp at entry would simplify some things, so I'm not against it in principle. But it will add extra overhead, especially on RISCV where it usually traps.) |
In the kernel design the fastpaths are each optimizations of a specific slowpath kernel operation. So if refill_sufficient() would return false in the slowpath and result in a thread switch, then the fastpath would have to produce the same thread switch, or else fall back to the slowpath. To facilitate this, the fastpath needs to perform the same check.
It arguably can't avoid performing the budget check. The kernel has to ensure there is enough remaining budget to get back into the kernel whenever it returns to user level. It has to guard against a thread calling late enough in its time slice and amplifying its budget via a non-preeptable kernel operation. I think there's a design inconsistency here where the fastpath is always expected to behave the same as a particular path through the slowpath, but the slowpath will now abort an operation if it doesn't have enough time to complete while the fastpath is hoped to be fastenough to always have enough time available to complete. Shouldn't the fastpath also abort an operation if it doesn't have enough time to complete? We either want to assume that:
|
|
Thinking about time is always tricky, hence my delayed response.
The kernel does not check at syscall entry if there is a higher priority thread scheduled within WCET time. Hence the worst case expected scheduling delay is equal to WCET. Even if a task calls a fastpath syscall just before it's timeslice ends, the worst case scheduling delay for the next task is much less than WCET. The budget amplification is negligible. However, avoiding that amplification will not reduce the total amount of kernel operation time, though it might avoid taking a timer interrupt. I'm not convinced that stealing
This is the assumption I am making. Scheduling is at 1 us granularity currently, so anything close to that is a rounding error.
I have the feeling that the kernel currently abuses WCET for kernel entry and exit overhead, while that should be separate values from WCET (the only explanation I have for In the end it's about what timing guarantees seL4 provides. In my view seL4 should be suitable for hard-realtime, which implies (ignoring IRQs):
Note that no budget overruns happening is not a design goal in itself. It may be necessary to be able to make certain guarantees, but I'm not fully convinced it is yet. It does avoid accumulating scheduling delays in case other tasks of the same priority all do budget amplification. However, that error is at most WCET per task (much less in reality), while the budget check causes tasks to lose up to In other words, currently tasks needing a budget of |
|
To clarify, I think a slight budget overrun is fine if it means the task will get that amount less budget next period, to avoid accumulating errors. |
|
Simplified implementation of budget thresholds can be found here: Indanz/seL4#2 Edit: To be clear, that only implement budget thresholds, NOT budget limits, which I think don't add enough value to implement. See Jira discussion. |
|
The new draft implementation does look simpler, but it mixes in other behaviour changes and does break verification by changing the preemption point behaviour and some other parts in the draft that are outside ifdef guards. I'm not necessarily against changing the preemption point behaviour in MCS, but it is unclear to me what consequences it has. It seems to me like this would change behaviour slightly (exit, take IRQ immediately, re-enter kernel). It also sounds to me like this would break the kernel entry/exit budget computations, because a single kernel call could now have the cost of entry -> preemption -> exit -> immediate entry -> IRQ handling/schedule new thread -> exit, whereas before it was entry -> preemption -> IRQ handling/schedule -> exit. To be fair, I'm not sure what the point of the entry/exit WCET is that MCS is doing, because it does not reflect the actual WCET and any thread can go over its budget roughly by the amount of the actual kernel WCET. If we have that, why do we try so hard to account for just entry/exit? Should we just admit that we can't fully guarantee absence of budget overrun by kernel WCET and just leave that up to user-level analysis? That would simplify a lot of things in the MCS implementation in general. (It would also make a lot of verification work obsolete that has already been put in to do verify all that complexity, but that's not a good reason to not get rid of it). I'm also not entirely sure about adding invocations to endpoints in general. This is a major deviation of the principle that you can only send messages to endpoints. If they have invocations, that would break the ability to simulate another cap via an endpoint, because however you encode that endpoint invocation is now something that is not available for other caps any more. There probably are some points in that design space (I haven't fully checked Indan's draft for this), but it does feel like a hack. Is there nothing cleaner that we can do? |
pingerino
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a high level pass due to the upcoming TSC meeting.
I don't have a lot of time to look at this.
On the surface adding thresdholds might be a useful change, though I would really like to see some use cases to motivate the effort involved in merging and verifying the change.
I'm happy to discuss this further. I've used conventional comments to make my comments, and tried my best to not come across as unsupportive, while also being appropriately sceptical. Great to see work being done to further the MCS apis.
| many systems. This is likely to simplify system design. | ||
|
|
||
| Currently, there are almost no restrictions on the amount of budget required to | ||
| enter a passive server. This almost inevitably leads to budget expiry, resulting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "this almost inevitably leads to budget expiry"
suggestion: "this can lead to undesirably budget expiry in certain system designs"
I wouldn't say it is invevitable, that's up to the system. If you have a system where overruns are tolerated, or where you think you have the WCET mostly sorted but not completely, you can have occasional budget expiry and handle it. Whether it's desirable or not is up to the system design and is a matter of policy. The wording is a bit strong.
|
|
||
| ### Thresholds | ||
|
|
||
| This change regarding thresholds is motivated by a desire to prevent budget |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: this would be better worded as "to provide an option for users to configure passive servers without requiring timeout handlers"
|
|
||
| This change regarding thresholds is motivated by a desire to prevent budget | ||
| expiry from occurring inside passive servers, improving worst-case performance | ||
| and response time analysis. Additionally, it would shift timeout exceptions into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing about this improves worst-case perf or response time analysis - it allows you to design a different kind of system, but ultimatley your response time analysis relies on everything in the entire system. It allows you to do a different kind of analysis, where clients can't call into servers if they don't have enough budget. Perhaps there's an argument to be made to say that this kind of analysis has better worse case performance, but proving that is in the territory of real-time theorists.
(one could for example argue that to have these budget limits, you'd need a server WCET so high that your client budgets would be big enough to result in a worse outcome analysis wise - due to the nature of current WCET tooling - it could be faster to handle a timeout than to pad to the WCET)
This change can be motivated simply by stating you want to provide an option to build a different kind of system, where you do not need to wait for the timeout to deny the operation.
|
|
||
| This RFC proposes a new mechanism, endpoint thresholds, which aim to prevent | ||
| budget expiry from occurring in passive servers, rather than reacting to it | ||
| after it occurs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this RFC is trying to build a compelling case, however I think this can be done without denigrating the existing api. If the proposal wanted to remove of the existing API, then it would need to argue why those mechanisms are not useful.
The motivation here would be improved greatly by adding that this is additional, demonstrating why it's needed, without saying all the other options are bad. Reacting is a legitiamte way to build a system, especially if your WCET is very loose. They have legitimate use cases depending on client policy.
| being a true error case, rather than an unavoidable, but routine occurrence on | ||
| many systems. This is likely to simplify system design. | ||
|
|
||
| Currently, there are almost no restrictions on the amount of budget required to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "there are almost no"
suggestion: "there are no" - there's not a limit at all
|
|
||
| ### Budget Limits Guide | ||
|
|
||
| The optional budget limit change extends the meaning of an endpoint threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: can there be some use cases here to illustrate what this is for?
|
|
||
| ## Rationale and alternatives | ||
|
|
||
| A key benefit is that this change will support better schedulability analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the claim is better schedulability analysis, then I'd really like to see the evidence.
Reduced server complexity is definitely a possibility, as you do not need a timeout handler, neccessarily. But you probably still need one if your WCET analysis is insufficient (this relies heavily on WCET being correct). If you have a WCET you are really certain of and it's really tight, you don't need thresholds as you won't hit them.
I'm not opposed to thresholds being added but increasing the size of endpoints needs to be weighed against the reality of this feature being used in practice. I'd love to see more illustrative examples or concrete analysis about schedulability.
| more predictable behaviour is of particular benefit for schedulability analysis. | ||
| In particular, this predictability is highly important for real-time systems | ||
| that the MCS kernel is targeted at. Further, this change also moves timeout | ||
| faults into a true error case, rather than an unavoidable consequence of passive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: Timeout faults were never intended to only be for error handling.
| @@ -0,0 +1,469 @@ | |||
| <!-- | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming to this PR very late, since it's being reviewed by the TSC this week. I'm not vehemently opposed to the change but I'm not yet convinced by the motivation and would love to see my questions answered. It's great to see work happening on the MCS api and I love to see it develop and evolve.
| backward API compatibility. It would however, break existing binary | ||
| compatibility. | ||
|
|
||
| - Two true system calls, the existing `seL4_Yield()` and the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can avoid breaking binary compatibility that would be good.
I would suggest a simple seL4_YieldUntil(ticks_t available)
(1) The original intent was that we would one day have a real WCET and update it. Not sure what the status there is. |
It will make pre-emptible sycalls run longer because of the extra overhead.
Usually there is no pending IRQ and it would just restart the current syscall.
No, it just does the usual cost of entry -> preemption -> exit. The budget is updated and checked at the next entry as done normally for all syscalls. It is the current code that is weird and duplicates normal bookkeeping and IRQ checking in the pre-emption code to avoid running out of budget.
Agreed with all of that. I think the harm of enforcing MIN_BUDGET is bigger than the gain, see my discussion with Kent above. But we should continue this discussion in a new issue.
I'm not doing that, I'm just saying it's a complication that needs some solution.
My hacky solution is making it a reply object invocation. The rationale being that for calls you need access to an EP cap and a reply cap anyway, and that this should be under the server's control. The original RFC proposed a CNode invocation as a work-around, but that's worse I think. Another way is using a magic nonsensical combination of flags values in |
I have the strong impression that WCET as used in MIN_BUDGET started as kernel entry/exit time, but ended up being WCET instead. Why else would MIN_BUDGET be twice WCET? |
Indeed. Apologies, I misread the code there. I think an invocation on the reply cap is acceptable. Still a bit of a hack, but better than all alternatives so far. |
I'm starting to be convinced. It'd be great to reduce complexity here. The main problem with pulling that into this RFC implementation is that it implies unscheduled verification work. @michaelmcinerney do you already have CRefine proofs for |
I do already have the proofs in CRefine for |
Yes, it was supposed to be enough time for a thread to get into and out of the kernel. Given we only theoretically had one estimate, this rounds up to 2 * WCET. |
|
A note on MIN_BUDGET: some practical experiments are needed here to quantify the impact on latency and precision if you don't set this. E.g, if a schedulable set of tasks is set up, what time does the highest prio task end up waking up. Modern hardware might mean this is very precise (back in the day I was supporting things like the now-defunct kzm) without this mechanism. |
I did a simple analysis in an earlier comment above. My practical experience is with a system with budgets in the order of 100us and a WCET of 10us, and then it causes a lot more scheduling jitter (as measured with RTP audio quality under heavy load). Setting WCET to 1 improved system behaviour and reduced jitter. (Sadly I don't have access to that system any more, so can't repeat the tests easily.) |
Original Jira issue and discussion
Rendered version of the RFC text.