forked from google/ghost-kernel
-
Notifications
You must be signed in to change notification settings - Fork 0
5.11 ghost current version #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wenhuizhang
wants to merge
165
commits into
5.11-base
Choose a base branch
from
5.11-ghost-current
base: 5.11-base
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Change-Id: I6665a8dc6cb3ac0b11df2eafb9459f67e74315d1
Change-Id: I243e9d328aed1fef8629a713a5734677a4fd9556
Change-Id: I0bdedf1467048bef0bf83c2722dfb1026d6f8e3b
Merge conflict resolutions: - kernel/sched/core.c: added out_return in pick_next_task Change-Id: I1371554d7cecbf371fdb811b937e7d878e0b6c8b
In a future patch, BPF programs will be able to call this. The check is to ensure that the caller's cpu's enclave matches the target cpu's enclave. This ensures that BPF programs (and thus their agents) are affecting their own enclaves. Regarding get_cpu(), I noticed the old code was calling smp_processor_id(), but it was in a preemptible path (via sys_ghost). The WARN_ON caught it. It seemed simplest to have the caller get_cpu, instead of mucking with the various return paths in the __ helper. Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I0ecf3902257280d86f737a7a7cbbcf201e392682
This is a BPF helper function that wakes an agent on a cpu. It is callable by any programs of BPF_PROG_TYPE_GHOST_SCHED. Right now, that's just the skip_tick hook. Soon, it will include PNT. The caller must belong to the same enclave as the destination. That's not a huge deal, but it will keep enclaves from mucking with one another. Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I924956f41dbf8fe921ce000ae1bf9231e5df44fe
This program runs whenever ghost has nothing to do: the agent is not runnable, there is no latched task, and there is no commit to do. The program returns 1 when it thinks PNT should retry its loop, such as if it woke the agent or latched a task. To prevent BPF from causing an infinite loop (consider a program that always returns 1), we only run it at most once per global pick_next_task(). Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: If409554e14df794fb75eba5beea0f596fb7b3522
This is a helper that bpf programs can call. It attempts to latch gtid to run on the calling cpu. It is similar to the old ghost_run, but it doesn't support "idle" or "agent". Unlike bpf_ghost_wake_agent(), this helper can only be called from the PNT attach point - we can extend it to other "trusted" attach points in the future. The helper calls ghost_run_gtid(), which grabs an RQ lock, so we need to be sure that the RQ lock is not currently held by whoever is calling the BPF program. (The RQ lock is held during the tick programs). One thing to note: this calls ghost_set_pnt_state(), which may replace a latched task. Our caller was run from PNT and held the RQ, but released it. In that time, another cpu could have committed the TXN on our RQ. We could have ghost_run_gtid() abort if it sees a latched task, thereby picking the TXN over the BPF programs answer. Though either way, some task will run and the other will be preempted. I opted not to "abort on latch", since I expect we may call this helper function from a BPF function on wakeup, and in those situations, the agent (via BPF) may want to preempt a latched task. Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I9e45d0ee262a51f535475152c9e58745c577257c
It is possible to reorder TASK_NEW and TASK_AFFINITY_CHANGED msgs
when an oncpu task executing sched_setaffinity() is switched into
ghost. This violates the assumption that TASK_NEW is the first msg
produced and confuses the agent.
The kernel defers producing a TASK_NEW msg when a running task switches
into ghost until the task schedules (the ghost set_curr_task handler
forces the issue by setting NEED_RESCHED on the task). The expectation
is that the task will schedule immediately and TASK_NEW is produced via
ghost_prepare_task_switch().
In some cases however this assumption is broken. For e.g. when an oncpu
task is moved to the ghost sched_class while it is in the kernel doing
sched_setaffinity().
Initial conditions: task p is running on cpu-x in the cfs sched_class.
cpu-x cpu-y
T0 sched_setscheduler(p, ghost)
task_rq_lock(p)
T1 sched_setaffinity(p, new_mask)
spinning on task_rq_lock(p)
held by cpu-y.
T2 p->sched_class = ghost_sched_class
T3 p->ghost.new_task = true via
switched_to_ghost(). MSG_TASK_NEW
deferred until 'p' gets offcpu.
T4 set_tsk_need_resched(curr) via
set_curr_task_ghost() to get 'p'
offcpu.
T5 task_rq_unlock(p) before returning
from sched_setscheduler().
T6 ... acquire task_rq_lock(p)
p->allowed_cpus = new_mask
T7 produce TASK_AFFINITY_CHANGED msg
via set_cpus_allowed_ghost() while
the TASK_NEW msg is still deferred.
Tested: //third_party/ghost/api_test (cl/384364382)
Effort: sched/ghost
Change-Id: I539718257123a115d2fcbace0f34b23263d2c5fa
It is possible to reorder TASK_NEW and TASK_DEPARTED msgs
when an oncpu task executing sched_setscheduler(cfs) is switched
into ghost. This violates the assumption that TASK_NEW is the
first msg produced and confuses the agent.
The kernel defers producing a TASK_NEW msg when a running task switches
into ghost until the task schedules (the ghost set_curr_task handler
forces the issue by setting NEED_RESCHED on the task). The expectation
is that the task will schedule immediately and TASK_NEW is produced via
ghost_prepare_task_switch().
In some cases however this assumption is broken. For e.g. when an oncpu
task is moved to the ghost sched_class while it is in the kernel doing
sched_setscheduler(cfs).
Initial conditions: task p is running on cpu-x in the cfs sched_class.
cpu-x cpu-y
T0 sched_setscheduler(p, ghost)
task_rq_lock(p)
T1 sched_setscheduler(p, cfs)
spinning on task_rq_lock(p)
held by cpu-y.
T2 p->sched_class = ghost_sched_class
T3 p->ghost.new_task = true via
switched_to_ghost(). MSG_TASK_NEW
deferred until 'p' gets offcpu.
T4 set_tsk_need_resched(curr) via
set_curr_task_ghost() to get 'p'
offcpu.
T5 task_rq_unlock(p) before returning
from sched_setscheduler().
T6 ... acquire task_rq_lock(p)
p->sched_class = cfs_sched_class
T7 produce TASK_DEPARTED msg via
switched_from_ghost() while the
TASK_NEW msg is still deferred.
Tested: //third_party/ghost/api_test (cl/384364382)
Effort: sched/ghost
Change-Id: I15b44a04ddab399509ccfa67503ce4813bc5c5e2
Before "a9a7f79: check need_resched with rq->lock held before doing switchto." it was possible for transaction commit to race with a task entering context_switch due to switchto. After a9a7f79 the race is no longer possible so we can simplify the 'latched_task' preemption logic in ghost_prepare_task_switch(). Tested: - kokonut test //sched:ghost_smoketest Indus http://sponge2/1632130f-df7c-4aa0-9141-31f810d9ad30 (one known failure tracked in b/192287338) Arcadia http://sponge2/e4045c4a-4c0e-4b72-ada0-4a9d4fec15bf - agent_muppet && switchto_test Change-Id: I7a9879cf50e624de9092acea18c5ea4600adbc3e Effort: sched/ghost
This WARN proved helpful while debugging an issue where a TASK_PREEMPT was being produced for a task that was in an active switchto chain. Note that we should be producing exactly two messages for a switchto chain: - TASK_SWITCHTO when the switchto chain begins. - One of TASK_PREEMPT/BLOCKED/YIELD/DEPARTED when the chain is broken. Producing the TASK_PREEMPT while the chain was still in progress violated this contract and caused agent to CHECK fail. Tested: Ran all unit tests in virtme to verify that the warning is not seen. Change-Id: Ib187a13dff95ce63cede2e065214d9e9c2d4e5dc Effort: sched/ghost
This is easily reproduced by running //third_party/ghost/api_test --gunit_filter=ApiTest.SchedAffinityRace The panic was due to deferencing 'task->ghost.status_word' in task_barrier_inc() when delivering MSG_TASK_AFFINITY_CHANGED. Initial condition is task 'p' is executing on CPU2 in ghost: CPU1 CPU2 T0 ghost task 'p' oncpu in do_exit() but hasn't lost its identity. T1 sched_setaffinity(p) is able to find 'p' and take a ref on its task_struct. T2 'p' schedules for the last time and task_dead_ghost(p) releases resources in 'p->ghost' like the status_word. T3 set_cpus_allowed_ptr_common() takes task_rq_lock(p) and calls set_cpus_allowed_ghost() which panics the kernel when trying to deliver the AFFINITY_CHANGED msg. Tested: //third_party/ghost/api_test --gunit_filter=ApiTest.SchedAffinityRace in a loop for 10000 times. Change-Id: I04d3e731992f7552b9f7a71754547f881704f1ae Effort: sched/ghost
Suggested by: brho@ Tested: run all unit tests in virtme and verify warning is not produced Change-Id: I2a4ae2daaefb422d7c6d95f1d4c8cf0b8ee4a1fd Effort: sched/ghost
Only a task that is running can accumulate cputime: specifically a task that is runnable but not running could not have accumulated cputime since the last time it got off the cpu. While debugging cl/386301191 it became evident that dequeue_task_ghost is routinely called for runnable-but-not-running tasks when migrating a task via ghost_move_task() during txn commit. In this case the call to update_curr_ghost() serves no purpose and is just overhead. Fix this by calling update_curr_ghost() from dequeue_task_ghost() only when the task being dequeued was running. Change-Id: Ic3421f250a8ffc4ccf16b54309c1589aaa9cacd7 Effort: sched/ghost
…ts a remote CPU Currently, RTLA_ON_IDLE may only be used when an agent is yielding itself. However, there are scenarios where a global agent is scheduling remote CPUs and it wants the satellite agents on those remote CPUs to wake up when the remote CPUs go idle. Thus, this KCL relaxes this constraint on RTLA_ON_IDLE so that the global agent may use that flag when committing transactions that target remote CPUs. Tested: Coming soon Effort: sched/ghost Change-Id: I05f23be82faf189addfbad69cec14a33b6c8713c
This helps schedghostidle determine if the latcher was BPF or not. Tested: schedghostidle with BPF Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: Ibd18726d057eeb9396afbd27688477855106d14d
If the agent is using BPF-PNT, knowing that a task got latched simplifies the agent's state machine and accounting. Instead of inferring success by getting a preempt or block message, the agent can request a "latched" message. The main benefit to the agent is that it can easily determine the state of a given Cpu. e.g. TaskLatched handler sets cs->current. Once cs->current is set, that lets the rest of the scheduler interact with that Cpu as if it was latched by a normal transaction. For instance, it can tell if a cpu is available or not. Additionally, the agent knows when the task was latched, which helps with its accounting, and we do not need to have yet-another case in all of the message handles (e.g. "from_bpf", just like "from_switchto"). The agent could check with its BPF maps to determine the outcome of a BPF-PNT, however this is outside the stream of messages for a task, which makes the agent's job difficult. (Specifically, if the agent is looking at a BPF map for the outcome of a successful transaction, it needs to be careful: it may already have handled a TaskDeparted for that task. The agent could track which BPF slot every task used, but that does proscribe specific ways that the agent must use BPF, versus a "fire and forget" model for running tasks.) It helps the agent's cpu accounting for us to always send TASK_LATCHED for a cpu after the previous task has left the cpu (e.g. preempt). However, if the BPF-Latched task is preempted before it got on cpu, we might send the message for TaskLatched before the previous task's "got off cpu" message. To help the agent handle this scenario, we tell the agent this was a latched_preempt. Also, that WARN_ON_ONCE we had can get hit if you have a latched task when the agent exits. To be submitted in conjunction with cl/387829782. Tested: edf_test Change-Id: I95892b9985e88081884c2e75ef996880f07064a5 Signed-off-by: Barret Rhoden <brho@google.com>
This field tells the agent that the task never actually got on cpu and was preempted in the latched state. Despite our attempts to maintain ordering of messages, it is possible to send latched preemption messages before the previous task got off cpu. I ran into this with BPF-PNT, but I have a hard time recreating the scenario. I think it was something like this: 01: task0 on_cpu in switchto 02: spurious PNT call, perhaps CFS briefly woke and migrated a task 03: since t0 was in switchto, we set must_resched, so we have to pick some other task to run. 04: there is no latched task, so BPF runs 05: BPF latches a task: task1. the RQ is unlocked briefly in here. 06: the agent wants to run task2. it issues a txn while the RQ is still unlocked. 07: task2 gets latched. to do so, we must unlatch task1. this sends TASK_LATCHED and TASK_PREEMPT for task1. 08: commit complete, unlock the RQ 09: we're still in PNT, it grabs the RQ lock and continues 10: task2 is latched and selected to run 11: context switch from task0 to task2, producing TASK_PREEMPT (with from_switchto) for task0. The order of messages was: - Latched task1 - Preempt task1 - Preempt task0 That's a little weird, since userspace thought that task0 was on cpu, yet it receives Latched for t1. We could attempt to send the preempt for t0 when we latch t1, however it turns out we easily send latched-preempts out of order. Consider an agent issuing several transactions for the same cpu: 01: task0 on_cpu in a non-preemptible region 02: agent commits/latches task1. sets need_resched. 03: before we run PNT, the agent sees the txn from t1 completed, and issues another. 04: agent latches task2. when doing so, it preempts task1. (latched_task_preemption), send TASK_PREEMPT for t1. 05: PNT runs, ghost_produce_prev_msgs sets check_preempt_prev 06: PNT picks task2, since it was latched 07: context switch from t0 to t2, send TASK_PREEMPT for t0. The order of messages was: - Preempt task1 - Preempt task0 Even though task0 was on cpu. The agent can handle this, since it knows the transactions completed, and it can adjust cs->current and cs->next. But from looking at the messages, you can't easily construct the cpu's state. The agent had to use external information: success of the transactions. If the agent requested SEND_TASK_LATCHED, the order of messages would be: - Latched task0 (from before step 01) - Latched task1 - Preempt task1 - Preempt task0 Madness! We can't send the messages for task0 when we latch task1, since we don't know yet if task0 will block or yield or be preempted, so we'll have to live with this. To be submitted in conjunction with cl/387829782. Tested: agent + switchto_test and simple_exp Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I9f2a62eff021eafdaf50763b36e5883a5cae397b
In a couple places, we've relied on the fact that we had a single message channel in the global agent, forcing an ordering of messages for a given cpu, such that we handle the messages in the order they were sent. That is no longer true, as we move to having multiple channels with tasks sharded among the channels. Messages for a task are always sent in order to a particular channel, but tasks running on the same cpu might not use the same channel. To allow the agent to manage cpu state when receiving out-of-order messages pertaining to a cpu, we now send the cpu_seqnum. The cpu_seqnum is a per-cpu "history" counter included with all messages that contain cpu state. This state depends on the message. For TASK_LATCHED, it is that the task is latched on that cpu. A TASK_PREEMPT for task0 on cpuA that means "cpuA has no ghost task running" for that instant when we sent the preempt. That state is true when we send the message. But right after the message was sent, we could have a ghost task on cpu: perhaps by a completed transaction, which does not involve a message. The agent can use the cpu_seqnum to discard old cpu state information. This requires that all cpu state be determined by the current message. It'd be a pain to recreate state by handling *all* messages. Addtionally, we'll eventually need to handle losing messages, so we don't want to rely on reconstructing state from out-of-order messages, since some messages may never arrive. To be submitted in conjunction with cl/387829782. Tested: edf_test, agent + switchto_test. No changes in userspace yet. Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I088cf6de2e7ade84f29cd53b05e1854dd01190a4
cpu_seqnum tells the user which moment in time we were at when we committed the transaction. Agents which use Async transactions (i.e. don't spin until the kernel commits it) treat the completion of the transaction as an implicit TASK_LATCHED message. By adding cpu_seqnum to the transaction, we correctly order this implicit message with the real cpu messages. It is safe to increment the cpu_seqnum independently of the message stream: unlike our current task_barrier, the seqnum does not have a 1:1 relationship with a message. If we wanted to, we could increment that counter at arbitrary points. When a global agent handles out-of-order messages and there were pending transaction, it is difficult to determine what to set cs->current to. The agent only updates cpu state when it gets a message with a newer cpu_seqnum. However, if there is a pending transaction (CpuState cs->next != NULL) it must SyncCpuState at some point. This function checks to see if a transaction is complete, sets cs->current = cs->next, and sets cs->next->state = ON_CPU. The agent wants to sync before handling messages for cs->next; e.g. if we get TaskBlocked for next, we want to set its state to ON_CPU first. (Arguably, we could ignore this). Since we may handle messages out of order, and only the most recent message should mess with CpuState, we need to Sync in any message, even messages for newer tasks on that cpu. Consider this: Messages and cpu_seqnum are in parens. 01: task0 is already running 02: agent decides to preempt task0 with task1 03: task0 blocks (B0) 04: cpu has no latched task, runs BPF 05: BPF latches task2 06: task2 runs (L1 - sent when t2 is on_cpu) 07: agent commits txn to run task1 08: task2 is preempted (P3, on ctx_switch, since it was on_cpu) 09: task1 blocks (B4) 10: BPF latches and runs task3 (L5) There are other similar scenarios, such as if task2 is latch-preempted, if you change 6,7,8 -> 7,6,8 (t2 doesn't run, but we get a TaskLatched). Due to out-of-order messages, the only constraint on messages is that the agent receives L1 before P3, since they both belong to the same task. The agent will use the cpu_seqnum to know which is the most recent version of history. Additionally, when the agent checks messages, it may be at any time after step 7. Let's say all steps completed. Say we receive L1 first. When we handle it, cs->next is set. If messages were in order, we could wait until we get B4 to sync state, since B4 pertains to task1 and is the message that must happen after the commit. And we need to set cs->current at some point: if we don't, we may have a task on_cpu but current is not set. But since messages are not in order, if we don't Sync now (when we get L1), we might not have another opportunity to Sync, because B4 could be older in history. In the case here, B4 is *newer*, so we will be able to muck with cs->current. But when we're at L1, we don't know that yet. We could actually be handling L5 - there's no good way of knowing. All we know is that we're more recent than the previous cs->cpu_seqnum. The root of the issue is that when we SyncCpuState, we are inferring a TASK_LATCHED message: we know a commit succeeded, so we set cs->current. If we asked for a TASK_LATCHED with the commit (which is what BPF does), then we would be able to just set cs->current in the TaskLatched() handler, and *not* do it in SyncCpuState (i.e. have Sync only reap the commit, but not muck with *cpu* state). But since we don't want to get TASK_LATCHED messages all the time, we infer that the latching happened. The solution is to treat the completion of the transaction as a change in the cpu's history: increment and report the cpu_seqnum. That allows userspace to both reap the transaction as well as (optionally) change its CpuState: if the transaction represents a newer version of history. To be submitted in conjunction with cl/387829782. Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I0d7988ba7a272e96841974d7886e13e975888886
TASK_DEPARTED has a cpu parameter, but if the task was not actually running, then the agent can't infer anything about the cpu. In particular, it does not know if a switchto chain ended (even if from_switchto is not set) or if there is another ghost task current. In the new world of out-of-order messages, the main thing about from_switchto is for the *task*, not for the cpu. It's the kernel's way of saying "you probably weren't expecting this task to run, since you thought it was blocked, but it woke up, ran, and then blocked/preempted/departed". Since from_switchto is pertinent only for the task, we can't use cpu state to help interpret it. (Also, was_current will always be true for every message other than departed, where it is occasionally false). But from_switchto no longer conveys *useful* information about the cpu itself. It does tell us at that point in cpu_seqnum history that we left switchto. But when we handle a message with from_switchto=true in the agent, we might already have handled a *newer* message (one with from_switchto=false) and we left ST in response to that newer message (it's an "implicitly left ST, discussed below). Alternatively, we might have not received the original TaskSwitchto message yet. Keep in mind this rule: the agent can only use the most recent cpu_seqnum to adjust cs->current or cs->in_switchto. So if we receive a message that is newer than any others, then we can adjust in_switchto. For almost every message, we'll leave the ST chain: basically any message that means "X is or was running on this cpu" (latched, blocked, preempted, departed(was_current=true)". I think of these as implicitly leaving ST. The important part is that whether or not we leave ST is independent of whether the payload's from_switchto was set. We leave ST as soon as we get a message that implied the ST chain ended. Later on, we might receive an older message with from_switchto set: but we already set cs->in_switchto=false. We might also receive an older message TaskSwitchto. But it's old, so we ignore it: we already left. Here's another scenario that shows why we can't use from_switchto to adjust cs->in_switchto: ... ST1(task0), B2(task1,from_st=true), L3(task2), ST4(task2), ... those are the messages in the order they were sent: we enter ST, then block with from_st=true. Then BPF latches a new task, then that task STs. We could receive those messages out of order: (L3 and ST4 are from the same task, so they are in order). possible order of handling: ST1, L3, ST4, B2(from_switchto=true). When we handle B2, we can't touch cs->in_switchto, since 2 < 4. Otherwise, we'll falsely think we're no longer in a ST chain. The ST chain ended with L3. All the agent knew was that we were in an ST chain, then suddenly we latched something. The agent knows it'll eventually get some message with from_switchto set, but due to the above scenario, it can't wait until it gets B2 to adjust in_switchto, since we might be back in a ST chain. Also: I was slightly worried that the was_current check wouldn't handle the case where the departed task was latched, but was_current=false. However, in that case, the kernel will send a preemption before the departed(was_current = false). So it's OK. Also also: Neel pointed out that the agent will be able to check task->on_cpu() to determine if it was current or not. This flag makes it a little easier on the agent, but is essentially a double-check. Since the task's messages are delivered in order, the agent would know a task latched (either via MSG_TASK_LATCHED or by completing a TXN) before it handled MSG_TASK_DEPARTED. To be submitted in conjunction with cl/387829782. Tested: agent + switchto_test and simple_exp Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I398a6f2e4b315d78864d8a85e74f7c57f780f415
Change-Id: I67d6f61b521238f774c710d1232b5978b12d7eb3
Prior to this change 'txn->commit_flags' was used to communicate singular, mutually exclusive values: - greedy commit (commit_flags = 0) - inline commit (commit_flags = COMMIT_AT_TXN_COMMIT) - PNT commit (commit_flags = COMMIT_AT_SCHEDULE) This made it impossible to use 'commit_flags' to communicate any other information to the kernel. For example ALLOW_TASK_ONCPU is communicated via 'run_flags' even though 'commit_flags' would be a better choice (ALLOW_TASK_ONCPU is only relevant at commit time as opposed to when the task is running). Fix this by interpreting 'commit_flags' to contain individual flag values as opposed to a singular enumeration value. Note that there is no change in ABI. Tested: all unit tests pass in virtme Effort: sched/ghost Change-Id: I0fa869fce9ca30870c29f6637e92fd9edcf28fae
The effect of ALLOW_TASK_ONCPU is limited to when a transaction is committed so it goes into txn->commit_flags. Submitted in conjunction with cl/388747781 Tested: verified all unit tests in virtme. Effort: sched/ghost Change-Id: I5a8fb2e9c964d61527252f355b010730f419aaf4
Advertising a departing task as runnable invites a race where the agent can try to schedule the task before it handles the TASK_DEPARTED msg. In this case the commit fails with a GHOST_TXN_INVALID_TARGET error which is a fatal error in the agent (b/195081642 has more details). Forcing 'runnable=false' in this situation leads to a couple of discrepancies: - 'task_new->runnable' is not coherent with the GHOST_SW_TASK_RUNNABLE flag in the task's status_word (but ultimately fine since this is indistinguishable from a blocked task that woke up while the agent was handling the TASK_NEW). - 'task_new->runnable' is not coherent with 'departed->was_current' but this should be okay since 'was_runnable' deals with task state whereas 'was_current' deals with cpu state. Tested: - 10000 iterations of api_test --gunit_filter=ApiTest.DepartedRace - kokonut test //sched:ghost_smoketest http://sponge2/a5f65f0e-68e4-4152-94c8-e1d6628bc42c Effort: sched/ghost Google-Bug-Id: 195081642 Change-Id: I89ebea15961bd70c2899f2c324d3e15be6fc49d3
Replace the ghost_switchto_disable sysctl with a "switchto_disabled" enclave tunable. This tunable has the same meaning as the sysctl and is accessible via ghostfs (similar to runnable_timeout). The associated test is modified in cl/387863829. The presubmit failure will go away once gtests is updated to that CL. Tested: //prodkernel/tests/switchto_tunable_test (cl/387863829) in virtme Effort: sched/ghost Google-Bug-Id: 195752832 Change-Id: I71c672d0bd07d0e06694b623610d2d05b8b02292
There's no way to prevent ghost client tasks from grabbing the kernfs_mutex. It's used in sysfs and other places. If an agent task, responsible for scheduling tasks, ever attempts to grab the kernfs_mutex, there's a chance we'll deadlock. Specifically, the client can hold the mutex and be descheduled. If the agent blocks on acquiring the mutex, it will never schedule the client. In general, ghostfs operations on open FDs don't touch the kernfs_mutex - there might be corner cases I didn't see. But we definitely grab it when opening paths (walking or lookups). The agent doesn't currently open any files from ghostfs from agent tasks, so this warning hasn't fired yet. But at least we'll catch it if it does and avoid a potential source of deadlock. Tested: edf_test, enclave_test, api_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I38422580ccc6b098fd2ee556d83e09f919a467cb
This adds an extra field, 'type' to the ghost_timer_fd structure. This allows the agent to distinguish between differnt types of timers that it may queue. The cookie field in the timer_fd then acts as an argument to be interpreted based on the type of timer. Paired with cl/414055464. Tested: In combo with the userspace changes. Effort: sched/ghost Change-Id: I80bbded859b338123b0873fdb91bf554034113f4
In preparation for merging into sched/next get rid of some gratuitous diffs. Tested: all unit tests pass in virtme. Effort: sched/ghost Change-Id: I87885570714ee0709984ce657c96a0a1655e20e3
gbuild -s ARCH=arm64 Tested: amd64 builds Effort: sched/ghost Change-Id: I66b1f122a89fddb41c1065f7e7b16b3357b8028d
Adding the 112 byte struct in the middle of 'struct rq' perturbed the cache locality of the following members which exhibited as increased cputime in the SwitchFutex benchmark (especially when the threads are not in the same LLC domain on AMD cpus). Effort: sched/ghost Change-Id: If17714649c9d351e85aa0e759f57b64d29b6adc4
Currently, whether or not prev is allowed to be selected by PNT is controlled by rq->ghost.must_resched, which is protected by the RQ lock. Setting must_resched means that the next time PNT runs, prev will be preempted by something - even if it is the idle task. (When prev is not an agent). An agent can request that prev be unscheduled by issuing a NULL transaction. This grabs the RQ lock. We will need to preempt tasks from BPF-MSG, which already holds an RQ lock, so we need to set must_resched, or something similar, locklessly. The lockless nature of this is tricky - if we're in PNT, already ran BPF-PNT, and then the agent tries to resched the cpu, do we want the resched to take effect or not? Was the agent referring to prev (who is different than the latched task)? Or referring to the task they just latched? This sounds a lot like why we have barriers. The agent wants to change state, based on its understanding of the system. Only apply that change if the agent's understanding matches reality. We already have the cpu_seqnum: the history/state of a given cpu, exposed in all relevant messages. Even better, unlike the barriers, userspace has no assumptions about being told about every increment to cpu_seqnum. We can increment it whenever any cpu_state changes and we don't have to send a message. So when the agent (via bpf, in an upcoming commit), asks us to resched, they can tell us the cpu_seqnum they got from their last TASK_LATCHED. As a side note, this means that they should *not* use the return value of bpf_ghost_run_gtid() to mean "it's ok to resched right now!", since they don't know the cpu_seqnum. Even if we returned that value, as soon as we send TASK_LATCHED, we increment again (it's another change in cpu state, perhaps better renamed as TASK_ON_CPU, but that's a separate issue). The agent should only resched based on the latched->cpu_seqnum from when prev got on cpu. Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: Ie9edaab22ae2f77078ca2b887fe72d4d528b2e80
This is a bpf helper, callable from any enclave BPF program (i.e. BPF-PNT, BPF-MSG, or skip-tick). It will force the cpu to reschedule, such that "prev" will not be picked again. Agents can use this to preempt a task. When the cpu reschedules, assuming there is no latched task, BPF-PNT will run so that the agent can pick another task to latch. A few notes: - You can call this from within BPF-PNT. If you call it on your own cpu, it will have no effect, mostly, similar to setting need_resched during pick_next_task(). The technical reason it won't work is that the kernel calls clear_preempt_need_resched() right after pick_next_task(), so any "need_resched" set during PNT will be a noop. "Mostly", since you could write "cpu_seqnum + N" and have must_resched set at some arbitrary point in the future. (The kernel cleans this up during cpu/agent teardown). - check_same_enclave() works so long as you are in an rcu read critical section. We're a bpf helper, and our bpf programs are run under an rcu_read_lock(). Someone could remove the cpu from the enclave, but that cpu cannot be handed out to another enclave until after a grace period. This is the case for all of the ghost bpf helpers. - I think you don't need an smp_wmb() between the WRITE_ONCE and the resched_cpu_unlocked() (which is a prodkernel function). That has a CAS in it, and in general, I'd expect sane behavior between the ordering of writes and IPIs in Linux. Though that might not be sane. To be submitted with cl/422883910. Tested: edf_test, using the helper in bpf-msg. Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I2746e6c789010c3a1b0014483d3d2b788723149f
Certain operations, particularly ghost bpf helpers, are operating on some enclave, but it might not be the enclave that owns the cpu. Consider an enclave on a set of cpus, and a task on another cpu enters the enclave. That generates a TASK_NEW, which triggers BPF-MSG. That all runs on a cpu that is not in the enclave. Add per-task *__target_enclave to track which enclave we are targeting with our functions. Users of the target_enclave can nest, such that if we are in the middle of some operation and get interrupted, the IRQ handler can set its own target_enclave so long as it restores it. Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I1cf7bc70bf2bda6d133155b9a8877ca531074587
Directories before files, then alphabetical sorting. No capital letters. Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: Ib9267e091342ea8a84c4e0e639fe3fae5a673fba
put_user() may fail at the end of ghost_associate_queue() while dst_q was set successfully. Return the status directly instead of writing it to the ioctl parameter. To be submitted with cl/427797430. Effort: sched/ghost Google-Bug-Id: 202070945 Change-Id: I9a72bde02afa556bf5c644be4bb80c98274c1b3a
Prior to this change a task with a poisoned rendezvous could produce a MSG_TASK_PREEMPT (e.g. if the next task picked to run on the cpu belongs to CFS). This is not intuitive since the overall sync_group has failed. It is also easy to miss since producing MSG_TASK_PREEMPT is the exception. For e.g. see cl/427576696 that fixes flaky ApiTest.LocalAgentWakeup Effort: sched/ghost Google-Bug-Id: 214648944 Change-Id: Ib13c38f60b8d1acbd292d4de7b8d7ff528f44c4f
At the moment ghOSt supports more than one enclave running concurrently on the host as long as they all have the same ABI. Relax the check in _ghost_resolve_enclave() to reflect that since it is possible for the task_reaper (which is a CFS task) to run on any cpu (including one that belongs to an enclave different than the one it is reaping). Effort: sched/ghost Google-Bug-Id: 218869667 Change-Id: I1b79d4468c343a5a837349ba6984f79bf1f1ec40
Instead of using syscalls, use ghostfs to move tasks to an enclave. This cleans up core.c's setsched code. We had a weird dance of passing along a ctl_fd, then converting that to an enclave, but we had to put the enclave after unlocking the RQ. It wasn't pretty. Instead, if we enter the kernel through ghostfs, the enclave is already known, and its refcounts are managed for us. This is analogous to using ghostfs ioctls instead of raw syscalls: we do not need any magic to figure out which enclave we're using. To that end, there are two mechanisms to enter ghost, one for regular tasks and another for agents. To add a regular task to an enclave, write its pid into enclave_x/tasks. Write 0 for 'current'. The tasks file is world writable, so anyone can join an enclave, if they can find it. However, to move another task (pid != 0), you need CAP_SYS_NICE, just like with sys_sched_setscheduler(). You can read the tasks file to get a list of tasks in the enclave, though beware that if you have too many tasks, the file might be truncated (seq file overflow). To add an agent to an enclave, write "become agent <QFD>" into ctl. You need write access to ctl, which an agent will have. The QFD is the FD of the queue you want for this agent, -1 for "default". qfd is the same parameter we used to pass in the sched_attr. To be submitted with cl/424409182. Tested: edf_test, enclave_test, api_test, transaction_test, manually moving tasks into an enclave from the shell. Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: Ib4ead6536354259ca1a464e62d72220c0465eef4
Now that there are no other users of ctlfd_to_enclave, we can clean up how bpf programs are linked. - We know the ABI from the expected_attach_type, so we don't need to do the for_each_abi() check. - We only need the struct fd and the kernfs kn until we incref the enclave. Simply returning a kreffed pointer simplifies the code. - ghost_bpf_link_attach just needs to figure out the ABI. The rest of the guts can be in ghost.c. This means abi->bpf_attach and detach can be in ghost.c directly. Further, the ctlfd_to_enclave can be in ghost.c too. Future ABIs don't even need to use the ctlfd (though the old ghost_core.c didn't actually know what an enclave ctl is). - We no longer do any PROG_TYPE checking in ghost_core.c. That makes it easier to change our types and attach points in the future. e.g. dropping skip_tick. We still have PROG_TYPE_GHOST in other places in the kernel, so it's not doable yet. Tested: api_test, edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: Iea57c6061047dcb0bf0c4bb2e031a86bbc62b689
enum bpf_func_id is created by a macro that is used for an array lookup from helper ID to string (func_id_str in disasm.c). We don't want that array to be too big. However, we also don't want our helper ID numbers to clash with upstream's function IDs. That will happen eventually. Move the ghost bpf helper IDs out of bpf_func_id so that we can use a larger number without affecting the array. The array mapping is optional - it's used in func_id-to-name lookups, and those lookups are protected in case the func_id > __BPF_FUNC_MAX_ID. Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I5b221d1834f1afe11aed55f73216b9e3d8915f46
With the growing number of BPF helpers, we'll soon run out of space in the bpf_func_id space. Increase the base numbers for PROG_TYPEs, the ATTACH_TYPEs, and helpers: - progs: 1000 - attach: 2000 - helpers: 3000 There is enough room for growth in upstream's numbers that we'll either be merged upstream or dead before we run into conflicts. We're still well under the 0xffff requirement for attach types (since the upper bits are the enclave ABI). To be submitted with cl/424689285. Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: Ie0600e0b4478d3289377e4a4d58c9e13b4a491aa
This lets us do abi-specific parsing of the create command. Specifically, we can add abi-specific commands after the ABI part of "create $ID $ABI". Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: If9a8d678ec8a3072c83768672c57523a04cb3d2e Origin-9xx-SHA1: b549601a7b45001f48237968d9ba3c8b1cfc56a6 Rebase-Tested-11xx: compiled
Add an optional parameter to "create" for a GID of the new enclave. Default is root. Use the caller's EUID as the UID for the enclave. This parameter to create is ABI-dependent, since it is after version parsing, meaning we can change it easily. In an effort to run the agent as someone other than root, we'd like to chown the files in the enclave_dir to whoever will run there. Prior to this commit, all ghostfs files are owned by root:root. Additionally, we'd like to restrict the agent from changing the cpus. We can chmod g-w, so long as the agent is not the owner. Putting these together, the agent can use the group access bits, and whoever creates the enclave (some trusted daemon) can be the owner. The reason to make the creator the owner is so that they can chmod cpulist and cpumask. Example: enclave ID = 565, ABI = 65, desired GID = 716 (root):/sys/fs/ghost# echo create 565 65 716 > ctl (root):/sys/fs/ghost# echo 0-10 > enclave_565/cpulist (root):/sys/fs/ghost# chmod g-w enclave_565/cpulist (root):/sys/fs/ghost# chmod g-w enclave_565/cpumask (root):/sys/fs/ghost# ls -la enclave_565/ total 0 dr-xr-xr-x 3 root spaceghostd 0 Feb 3 13:02 . dr-xr-xr-x 3 root root 0 Feb 3 12:49 .. -r--r--r-- 1 root spaceghostd 0 Feb 3 13:02 abi_version -rw-rw-r-- 1 root spaceghostd 0 Feb 3 13:02 agent_online -rw-rw-r-- 1 root spaceghostd 0 Feb 3 13:02 commit_at_tick -rw-rw---- 1 root spaceghostd 458752 Feb 3 13:02 cpu_data -rw-r--r-- 1 root spaceghostd 0 Feb 3 13:09 cpulist -rw-r--r-- 1 root spaceghostd 0 Feb 3 13:02 cpumask -rw-rw-r-- 1 root spaceghostd 0 Feb 3 13:02 ctl -rw-rw-r-- 1 root spaceghostd 0 Feb 3 13:02 runnable_timeout -r--r--r-- 1 root spaceghostd 0 Feb 3 13:02 status -rw-rw-r-- 1 root spaceghostd 0 Feb 3 13:02 switchto_disabled dr-xr-xr-x 2 root spaceghostd 0 Feb 3 13:02 sw_regions -rw-rw-rw- 1 root spaceghostd 0 Feb 3 13:02 tasks -rw-rw-r-- 1 root spaceghostd 0 Feb 3 13:02 wake_on_waker_cpu (spaceghostd):/sys/fs/ghost/enclave_565$ chmod g+w cpulist chmod: changing permissions of ‘cpulist’: Operation not permitted (spaceghostd):/sys/fs/ghost/enclave_565$ cat cpulist 0-10 (spaceghostd):/tmp/$ ./agent_muppet --enclave /sys/fs/ghost/enclave_565/ That actually fails due to missing CAP_BPF, but the ghostfs stuff worked. Note that to create an enclave, you need write access to ghostfs/ctl, which is owned root:root. Tested: creating enclave and attached to it, edf_test, agent_muppet Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I93a6cca6c29e7a318460ef599c0b269d516ec4f2 Origin-9xx-SHA1: 799b9eb05be77331ec9b64eb9abe8dcc2b40182c Rebase-Tested-11xx: compiled
It should come as no surprise that calling the old ghost_destroy_enclave() directly can cause issues. We already had infrastructure to destroy the enclave with a work_struct, which helped when destroying the enclave from tricky sched code (IRQ context). It turns out that we also need the work_struct to handle destroying the enclave from ghostfs! Why? Consider the case where the task destroying the enclave is in the enclave, e.g. bash joined the enclave. We'll hang after the synchronize_rcu() call. Essentially, we're still in the enclave, the agents were killed, and we need someone to schedule us. We haven't reached the part of the code that kicks all the enclave tasks back to CFS. The cleanest thing is to just defer the work to CFS, which we already do when we destroy the enclave from IRQ context. Now, anyone can call ghost_destroy_enclave() without worrying about their context or other race conditions. Even an agent task should be able to do it. Tested: edf_test, moved bash into an enclave and destroyed via ghostfs Effort: sched/ghost Google-Bug-Id: 216664048 Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: Ief60ecfec4df05bab7d157f60e04d5e5f42a25bd Origin-9xx-SHA1: 28e5f4621e10640f34f6ac8b885239c4f51d0c51 Rebase-Tested-11xx: compiled
Skip-tick was our first BPF use case. Anything you could do with skip-tick can be done with BPF-MSG. However, not all agents want to use BPF, particularly for something as trivial as suppressing all cpu tick messages. Add an enclave tunable, "deliver_ticks", which defaults to 0. When set, we'll deliver cpu tick messages during the ghost timer tick. At this point, BPF-MSG can intercept the message, handle it, and optionally suppress it - just like any other message. To be submitted with cl/427258783. Tested: edf_test, agent_exp with/without --ticks, agent_muppet + simple_exp, api_test Effort: sched/ghost Google-Bug-Id: 210860596 Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I8beae844f241b94d8e80dbdbb12660e536bda607 Origin-9xx-SHA1: 4ee077db4d91c8cc3307188e953e0b336cf5eebf Rebase-Tested-11xx: compiled
GHOST_BPF and bpf_ghost_sched were in the main uapi bpf.h, but not the tools/ version of the header. It's not as simple as copying the entire bpf.h header from the kernel to tools/, since there are a bunch of other differences in non-ghost code, mostly in comments. Tested: //net/bpf:bpf_test_suite:test_progs Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: Ica002dabf68fba633fc511b6a4130a809256f552 Origin-9xx-SHA1: 9321a72de9b96bd566cacc5867e2799507aa01f4 Rebase-Tested-11xx: compiled
…grams Once we've inserted the bpf programs (attached or linked), we no longer need to load or attach/link more programs. In the off chance our agent is compromised, by disabling bpf_prog_load, the agent would be unable to load new hostile programs that it could attach to other places in the kernel. Our agents typically run with CAP_BPF, though not necessarily CAP_PERFMON, so the list of bpf programs we could load and attach is limited. This helper limits it further. The existing programs we already loaded are considered "known good". They cannot be attached to random places in the kernel: only to BPF-PNT and BPF-MSG. This helper is a one-way operation for the calling process. A new agent that attaches to the enclave will be able to load and attach new programs. Note you cannot have more than one program attached per enclave, per attach point. To be submitted with cl/429130082. Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I4895e78b0c7e49006f16dddb7341f109a3bd006b Origin-9xx-SHA1: 49667d54efeb559e9dd375f68ff88da0524e0f3b Rebase-Tested-11xx: compiled
No callers were checking the return value. Tested: edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I2bbb912c61b11358d2cf8dd3593c689f647415c2 Origin-9xx-SHA1: 8bb13a445545f9cdd0dac7eab761311e781bd056 Rebase-Tested-11xx: compiled
Previously, BPF-PNT ran only as a last resort, after several things: 1) if there is no runnable agent (not blocked_in_run) 2) if we will not yield to any other class between PNT-A and PNT-G, specifically to CFS (rq_adj_nr_running) 3) if there is no latched task 4) if the current task won't keep running (must_resched) 5) if there's no pending transaction (unlatched, but sitting in the txn region, perhaps with COMMIT_AT_SCHEDULE). After all of that, we'll run BPF, then since we unlocked, go back and recheck things like if the agent woke, if CFS woke, if bpf latched a task, etc. At the moment we unlocked the RQ to run BPF-PNT, cases 1, 2, and 3 were all possible to become false, which we were able to handle. However, that's a rare race. By exposing the RQ status via the context struct bpf_ghost_sched, BPF-PNT can make the same decisions it did before: e.g. if ctx->agent_runnable, then don't bother latching a task. The BPF program can detect cases 3 and 4 from ctx->next_gtid, but there is no nice way to detect case 5. The only real user was COMMIT_AT_SCHEDULE, which was added to speed up slow global agents and is largely unused. BPF-PNT solves the issues of COMMIT_AT_SCHEDULE, so I recommend not using both BPF-PNT and COMMIT_AT_SCHEDULE. Running BPF later and redoing the checks was a minor inconvenience, but we were essentially deciding for BPF-PNT what it wanted to do, and presumed that there was nothing other than latching a task that it could do. Submitted with cl/435677777. Tested: api_test, edf_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I4ba012add422ee34c4d2a103b1e19e9bd95cd765 Origin-9xx-SHA1: dc74a30f80d9b03483ef95f5797ae9fd7d3609d2 Rebase-Tested-11xx: compiled
Fixed in commit 875ee64be8cd ("sched/ghost: fix bpf_helper enclave
discovery").
Tested: compiled
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I01ca180a51ec04e4adfcc135ebd1cf2e63fafbc3
Origin-9xx-SHA1: 79d2ede01b2e3c178f72daf32426ea29fe385d0a
Rebase-Tested-11xx: compiled
select_task_rq() did not always call select_task_rq_ghost(), and even if it did, it could ignore the answer - particularly if the cpu was not in the task's allowed_cpus. The RQ doesn't matter much for ghost. The "real" RQ is in the agent, but the kernel still uses the (per-cpu) RQ for its own bookkeeping, including the steps needed to wake a task. For an in-kernel scheduler like CFS, the RQ matters: it's where the task will compete for cpu time and may ultimately run. For ghost, it's a temporary staging ground until a task gets latched (at which point it will change RQs). Of note is the TTWU_QUEUE feature. Normally, when we do a 'remote' wakeup, meaning select_task_rq() picks a cpu other than the caller's, we'd grab their rq lock, do the wakeup, and maybe IPI if we need to preempt. With TTWU_QUEUE, we can put the task on the RQ's wake_list, then send an IPI (amortized). This trades an IPI for lock bouncing - presumably if the task will actually run on its old cpu, then this will help. The problem arises when we pick a rq that is neither the task_cpu (i.e. previous cpu it ran on) nor the current cpu. This third-party cpu can get overloaded with resched IPIs from TTWU_QUEUE remote wakeups. CFS tries to find an idle RQ, and doesn't aggressively load balance or otherwise migrate tasks from a cpu. Ghost does neither. We offer a choice between task_cpu and the current cpu (waker's cpu). And we can yank tasks off remote RQs without hesitation: this is what happens during a latch. select_task_rq_ghost() would normally be OK, but prior to this commit, select_task_rq() would ignore the result under certain circumstances, specifically if the cpu was not in p->allowed_cpus. The ghost kernel ignores affinity: that's up to the agent to enforce, yet select_task_rq() was enforcing affinity, but only for the 'temporary' rq - not the actual execution of a task. Consider a task that was affined to numa node 1, yet the agent ran it on node 0. It wakes on node 0, and select_task_rq_ghost() returns a cpu from node 0. select_fallback_rq() will pick the first cpu on node 1 to do the wakeup, and using TTWU_QUEUE, it will send an IPI and that cpu will have to do all the wakeup bits. For ghost, that includes sending the wakeup message, which can run a bpf-msg program. Once the task is woken up, another cpu (perhaps from node 0) can latch it - essentially pulling it away from the victim cpu. If that task blocks and wakes quickly, and it was running on node 0, the cycle will continue. If you have enough cpus running/blocking/waking tasks, and shipping their wakeup work to a victim cpu, then that cpu will get stuck in IRQ context, constantly handling new resched IPIs. You'd need enough cpus to keep the victim busy. The aggrivating factor to this is bpf-msg. bpf programs will finish in a bounded amount of time, but there's nothing that stops them from being inefficient. The world's simplest scheduler (my Biff scheduler in a forthcoming CL) is a global queue, implemented with a BPF_MAP_TYPE_QUEUE, which is protected by a kernel spinlock, which is an MCS queue lock. If you have all the cpus hammering that lock too, it will make bpf-msg less efficient, to the tune of 40-50us per task ttwu(). Even without bpf-msg, you'd need only enough cpus to keep the victim busy. If the wakeup cycles is W_c, and the time to block, wake, run, and reblock is T_c (task), you'd need T_c / W_c cpus to keep the victim stuck in its IPI handlers. But a combination of bpf-msg and a really dumb scheduler increased W_c. The trick is to break the cycle: by requiring that select_task_rq() returns either task_cpu() or the current cpu, we avoid this "stuck in resched IPI" scenario: - if it's the current cpu, we don't need an IPI. - if it's task_cpu, then when some other cpu pulls it from the victim's RQ, that cpu becomes the new task_cpu, breaking the cycle. Tested: SLL Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: Ib65a1672420c72ff9d3bf71c0d50eba245a1019d Origin-9xx-SHA1: 37588684a0ce59fb763a4deb2fea31078e47984c Rebase-Tested-11xx: compiled
The agent's bpf-msg program returns whether or not they want us to send the message to their Channel. If they do not want the message, then we can free the status word directly in the kernel. Normally agents free the status word after a dead or departed message. If there is no message, userspace will not free the status word. Tested: edf_test, api_test Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I01f02e92581fe27405af1cae4345bb78b9aafc7b Origin-9xx-SHA1: d673ad0e788d84be97a83b984367b84c64c83011 Rebase-Tested-11xx: compiled
"discover tasks" will generate a task_new for each task in the enclave. This is a mechanism for live update: an agent can discover tasks that were already in the enclave before it started handling messages. Userspace agents that inherit from BasicDispatchScheduler() have their own discovery mechanism based on the StatusWordTable: for each status word, associate the task with the agent's channel. We use association to make sure we run exactly one TaskNew() for each task, including tasks that arrive concurrently. For agents that do not have a substantial userspace presence, such as the biff scheduler (bpf-only), it's simpler to generate task_new messages. Your scheduler just needs to ignore task_new for tasks it already knows about. Tested: agent_biff handoff Effort: sched/ghost Signed-off-by: Barret Rhoden <brho@google.com> Change-Id: I70d796d11074b4cb7d638d8531cead9fe3ef0b2b Origin-9xx-SHA1: 9dbec5a13eba0a8cce16bc2582f0191afb26aaa5 Rebase-Tested-11xx: compiled
Add #define GHOST_BPF to scripts/bpf_helpers_doc.py so that the GHOST_BPF macro is generated into bpf_helper_defs.h, which makes it visible to common.bpf.h (see cl/460745992). This allows us to guard the magic numbers there with #ifndef GHOST_BPF. Tested: pushed this change to test branch on github repo, edited WORKSPACE file locally in userspace repo and was able to build on GCE instance google@eb2fd69 Effort: sched/ghost Change-Id: I9edc790956aa92f4ca157180891c7fd6727a99d6
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.