Skip to content

Conversation

@wenhuizhang
Copy link
Member

No description provided.

jackhumphries and others added 30 commits June 11, 2021 23:41
Change-Id: I6665a8dc6cb3ac0b11df2eafb9459f67e74315d1
Change-Id: I243e9d328aed1fef8629a713a5734677a4fd9556
Change-Id: I7bf91efc96dfaf513ebb08f9408f36c57e5d3dec
Change-Id: Ia7fcc08f30bc2fe1097e3ff8dd1b879568c9ffc8
Change-Id: I0bdedf1467048bef0bf83c2722dfb1026d6f8e3b
Merge conflict resolutions:
- kernel/sched/core.c: added out_return in pick_next_task

Change-Id: I1371554d7cecbf371fdb811b937e7d878e0b6c8b
In a future patch, BPF programs will be able to call this.  The check is
to ensure that the caller's cpu's enclave matches the target cpu's
enclave.  This ensures that BPF programs (and thus their agents) are
affecting their own enclaves.

Regarding get_cpu(), I noticed the old code was calling
smp_processor_id(), but it was in a preemptible path (via sys_ghost).
The WARN_ON caught it.  It seemed simplest to have the caller get_cpu,
instead of mucking with the various return paths in the __ helper.

Tested: edf_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I0ecf3902257280d86f737a7a7cbbcf201e392682
This is a BPF helper function that wakes an agent on a cpu.  It is
callable by any programs of BPF_PROG_TYPE_GHOST_SCHED.  Right now,
that's just the skip_tick hook.  Soon, it will include PNT.

The caller must belong to the same enclave as the destination.  That's
not a huge deal, but it will keep enclaves from mucking with one
another.

Tested: edf_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I924956f41dbf8fe921ce000ae1bf9231e5df44fe
This program runs whenever ghost has nothing to do: the agent is not
runnable, there is no latched task, and there is no commit to do.

The program returns 1 when it thinks PNT should retry its loop, such as
if it woke the agent or latched a task.

To prevent BPF from causing an infinite loop (consider a program that
always returns 1), we only run it at most once per global
pick_next_task().

Tested: edf_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: If409554e14df794fb75eba5beea0f596fb7b3522
This is a helper that bpf programs can call.  It attempts to latch gtid
to run on the calling cpu.  It is similar to the old ghost_run, but it
doesn't support "idle" or "agent".

Unlike bpf_ghost_wake_agent(), this helper can only be called from the
PNT attach point - we can extend it to other "trusted" attach points in
the future.  The helper calls ghost_run_gtid(), which grabs an RQ lock,
so we need to be sure that the RQ lock is not currently held by whoever
is calling the BPF program.  (The RQ lock is held during the tick
programs).

One thing to note: this calls ghost_set_pnt_state(), which may replace a
latched task.  Our caller was run from PNT and held the RQ, but released
it.  In that time, another cpu could have committed the TXN on our RQ.

We could have ghost_run_gtid() abort if it sees a latched task, thereby
picking the TXN over the BPF programs answer.  Though either way, some
task will run and the other will be preempted.

I opted not to "abort on latch", since I expect we may call this helper
function from a BPF function on wakeup, and in those situations, the
agent (via BPF) may want to preempt a latched task.

Tested: edf_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I9e45d0ee262a51f535475152c9e58745c577257c
It is possible to reorder TASK_NEW and TASK_AFFINITY_CHANGED msgs
when an oncpu task executing sched_setaffinity() is switched into
ghost. This violates the assumption that TASK_NEW is the first msg
produced and confuses the agent.

The kernel defers producing a TASK_NEW msg when a running task switches
into ghost until the task schedules (the ghost set_curr_task handler
forces the issue by setting NEED_RESCHED on the task). The expectation
is that the task will schedule immediately and TASK_NEW is produced via
ghost_prepare_task_switch().

In some cases however this assumption is broken. For e.g. when an oncpu
task is moved to the ghost sched_class while it is in the kernel doing
sched_setaffinity().

Initial conditions: task p is running on cpu-x in the cfs sched_class.

    cpu-x                               cpu-y
T0                                      sched_setscheduler(p, ghost)
                                        task_rq_lock(p)
T1  sched_setaffinity(p, new_mask)
    spinning on task_rq_lock(p)
    held by cpu-y.
T2                                      p->sched_class = ghost_sched_class

T3                                      p->ghost.new_task = true via
                                        switched_to_ghost(). MSG_TASK_NEW
                                        deferred until 'p' gets offcpu.

T4                                      set_tsk_need_resched(curr) via
                                        set_curr_task_ghost() to get 'p'
                                        offcpu.

T5                                      task_rq_unlock(p) before returning
                                        from sched_setscheduler().

T6  ... acquire task_rq_lock(p)
    p->allowed_cpus = new_mask

T7  produce TASK_AFFINITY_CHANGED msg
    via set_cpus_allowed_ghost() while
    the TASK_NEW msg is still deferred.

Tested: //third_party/ghost/api_test (cl/384364382)

Effort: sched/ghost
Change-Id: I539718257123a115d2fcbace0f34b23263d2c5fa
It is possible to reorder TASK_NEW and TASK_DEPARTED msgs
when an oncpu task executing sched_setscheduler(cfs) is switched
into ghost. This violates the assumption that TASK_NEW is the
first msg produced and confuses the agent.

The kernel defers producing a TASK_NEW msg when a running task switches
into ghost until the task schedules (the ghost set_curr_task handler
forces the issue by setting NEED_RESCHED on the task). The expectation
is that the task will schedule immediately and TASK_NEW is produced via
ghost_prepare_task_switch().

In some cases however this assumption is broken. For e.g. when an oncpu
task is moved to the ghost sched_class while it is in the kernel doing
sched_setscheduler(cfs).

Initial conditions: task p is running on cpu-x in the cfs sched_class.

    cpu-x                               cpu-y
T0                                      sched_setscheduler(p, ghost)
				        task_rq_lock(p)
T1  sched_setscheduler(p, cfs)
    spinning on task_rq_lock(p)
    held by cpu-y.
T2                                      p->sched_class = ghost_sched_class

T3                                      p->ghost.new_task = true via
				        switched_to_ghost(). MSG_TASK_NEW
				        deferred until 'p' gets offcpu.

T4                                      set_tsk_need_resched(curr) via
				        set_curr_task_ghost() to get 'p'
				        offcpu.

T5                                      task_rq_unlock(p) before returning
				        from sched_setscheduler().

T6  ... acquire task_rq_lock(p)
    p->sched_class = cfs_sched_class

T7  produce TASK_DEPARTED msg via
    switched_from_ghost() while the
    TASK_NEW msg is still deferred.

Tested: //third_party/ghost/api_test (cl/384364382)

Effort: sched/ghost
Change-Id: I15b44a04ddab399509ccfa67503ce4813bc5c5e2
Before "a9a7f79: check need_resched with rq->lock held before doing switchto."
it was possible for transaction commit to race with a task entering
context_switch due to switchto.

After a9a7f79 the race is no longer possible so we can simplify the
'latched_task' preemption logic in ghost_prepare_task_switch().

Tested:
- kokonut test //sched:ghost_smoketest
  Indus		http://sponge2/1632130f-df7c-4aa0-9141-31f810d9ad30
  		(one known failure tracked in b/192287338)
  Arcadia	http://sponge2/e4045c4a-4c0e-4b72-ada0-4a9d4fec15bf

- agent_muppet && switchto_test

Change-Id: I7a9879cf50e624de9092acea18c5ea4600adbc3e
Effort: sched/ghost
This WARN proved helpful while debugging an issue where a TASK_PREEMPT
was being produced for a task that was in an active switchto chain.

Note that we should be producing exactly two messages for a switchto
chain:
- TASK_SWITCHTO when the switchto chain begins.
- One of TASK_PREEMPT/BLOCKED/YIELD/DEPARTED when the chain is broken.

Producing the TASK_PREEMPT while the chain was still in progress
violated this contract and caused agent to CHECK fail.

Tested:
Ran all unit tests in virtme to verify that the warning is not seen.

Change-Id: Ib187a13dff95ce63cede2e065214d9e9c2d4e5dc
Effort: sched/ghost
This is easily reproduced by running
//third_party/ghost/api_test --gunit_filter=ApiTest.SchedAffinityRace

The panic was due to deferencing 'task->ghost.status_word'
in task_barrier_inc() when delivering MSG_TASK_AFFINITY_CHANGED.

Initial condition is task 'p' is executing on CPU2 in ghost:
	CPU1				CPU2
T0					ghost task 'p' oncpu
					in do_exit() but hasn't
					lost its identity.
T1	sched_setaffinity(p) is able
	to find 'p' and take a ref
	on its task_struct.

T2					'p' schedules for the last time
					and task_dead_ghost(p) releases
					resources in 'p->ghost' like the
					status_word.

T3	set_cpus_allowed_ptr_common()
	takes task_rq_lock(p) and calls
	set_cpus_allowed_ghost() which
	panics the kernel when trying to
	deliver the AFFINITY_CHANGED msg.

Tested:
//third_party/ghost/api_test --gunit_filter=ApiTest.SchedAffinityRace
in a loop for 10000 times.

Change-Id: I04d3e731992f7552b9f7a71754547f881704f1ae
Effort: sched/ghost
Suggested by: brho@

Tested: run all unit tests in virtme and verify warning is not produced

Change-Id: I2a4ae2daaefb422d7c6d95f1d4c8cf0b8ee4a1fd
Effort: sched/ghost
Only a task that is running can accumulate cputime: specifically a task
that is runnable but not running could not have accumulated cputime
since the last time it got off the cpu.

While debugging cl/386301191 it became evident that dequeue_task_ghost
is routinely called for runnable-but-not-running tasks when migrating
a task via ghost_move_task() during txn commit. In this case the call
to update_curr_ghost() serves no purpose and is just overhead.

Fix this by calling update_curr_ghost() from dequeue_task_ghost() only
when the task being dequeued was running.

Change-Id: Ic3421f250a8ffc4ccf16b54309c1589aaa9cacd7
Effort: sched/ghost
…ts a remote CPU

Currently, RTLA_ON_IDLE may only be used when an agent is yielding
itself. However, there are scenarios where a global agent is scheduling
remote CPUs and it wants the satellite agents on those remote CPUs to
wake up when the remote CPUs go idle. Thus, this KCL relaxes this
constraint on RTLA_ON_IDLE so that the global agent may use that flag
when committing transactions that target remote CPUs.

Tested: Coming soon

Effort: sched/ghost
Change-Id: I05f23be82faf189addfbad69cec14a33b6c8713c
This helps schedghostidle determine if the latcher was BPF or not.

Tested: schedghostidle with BPF
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: Ibd18726d057eeb9396afbd27688477855106d14d
If the agent is using BPF-PNT, knowing that a task got latched simplifies
the agent's state machine and accounting.  Instead of inferring success
by getting a preempt or block message, the agent can request a "latched"
message.

The main benefit to the agent is that it can easily determine the state
of a given Cpu.  e.g. TaskLatched handler sets cs->current.  Once
cs->current is set, that lets the rest of the scheduler interact with
that Cpu as if it was latched by a normal transaction.  For instance, it
can tell if a cpu is available or not.

Additionally, the agent knows when the task was latched, which helps
with its accounting, and we do not need to have yet-another case in all
of the message handles (e.g. "from_bpf", just like "from_switchto").

The agent could check with its BPF maps to determine the outcome of a
BPF-PNT, however this is outside the stream of messages for a task,
which makes the agent's job difficult.  (Specifically, if the agent is
looking at a BPF map for the outcome of a successful transaction, it
needs to be careful: it may already have handled a TaskDeparted for that
task.  The agent could track which BPF slot every task used, but that
does proscribe specific ways that the agent must use BPF, versus a "fire
and forget" model for running tasks.)

It helps the agent's cpu accounting for us to always send TASK_LATCHED
for a cpu after the previous task has left the cpu (e.g. preempt).
However, if the BPF-Latched task is preempted before it got on cpu, we
might send the message for TaskLatched before the previous task's
"got off cpu" message.  To help the agent handle this scenario, we tell
the agent this was a latched_preempt.

Also, that WARN_ON_ONCE we had can get hit if you have a latched task
when the agent exits.

To be submitted in conjunction with cl/387829782.

Tested: edf_test
Change-Id: I95892b9985e88081884c2e75ef996880f07064a5
Signed-off-by: Barret Rhoden <brho@google.com>
This field tells the agent that the task never actually got on cpu and
was preempted in the latched state.

Despite our attempts to maintain ordering of messages, it is possible to
send latched preemption messages before the previous task got off cpu.
I ran into this with BPF-PNT, but I have a hard time recreating the
scenario.  I think it was something like this:

01: task0 on_cpu in switchto

02: spurious PNT call, perhaps CFS briefly woke and migrated a task

03: since t0 was in switchto, we set must_resched, so we have to pick
some other task to run.

04: there is no latched task, so BPF runs

05: BPF latches a task: task1.  the RQ is unlocked briefly in here.

06: the agent wants to run task2.  it issues a txn while the RQ is still
unlocked.

07: task2 gets latched.  to do so, we must unlatch task1.  this sends
TASK_LATCHED and TASK_PREEMPT for task1.

08: commit complete, unlock the RQ

09: we're still in PNT, it grabs the RQ lock and continues

10: task2 is latched and selected to run

11: context switch from task0 to task2, producing TASK_PREEMPT (with
from_switchto) for task0.

The order of messages was:
- Latched task1
- Preempt task1
- Preempt task0

That's a little weird, since userspace thought that task0 was on cpu,
yet it receives Latched for t1.

We could attempt to send the preempt for t0 when we latch t1, however it
turns out we easily send latched-preempts out of order.  Consider an
agent issuing several transactions for the same cpu:

01: task0 on_cpu in a non-preemptible region
02: agent commits/latches task1.  sets need_resched.
03: before we run PNT, the agent sees the txn from t1 completed, and
issues another.
04: agent latches task2.  when doing so, it preempts task1.
(latched_task_preemption), send TASK_PREEMPT for t1.
05: PNT runs, ghost_produce_prev_msgs sets check_preempt_prev
06: PNT picks task2, since it was latched
07: context switch from t0 to t2, send TASK_PREEMPT for t0.

The order of messages was:
- Preempt task1
- Preempt task0

Even though task0 was on cpu.  The agent can handle this, since it knows
the transactions completed, and it can adjust cs->current and cs->next.
But from looking at the messages, you can't easily construct the cpu's
state.  The agent had to use external information: success of the
transactions.  If the agent requested SEND_TASK_LATCHED, the order of
messages would be:
- Latched task0 (from before step 01)
- Latched task1
- Preempt task1
- Preempt task0

Madness!

We can't send the messages for task0 when we latch task1, since we don't
know yet if task0 will block or yield or be preempted, so we'll have to
live with this.

To be submitted in conjunction with cl/387829782.

Tested: agent + switchto_test and simple_exp

Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I9f2a62eff021eafdaf50763b36e5883a5cae397b
In a couple places, we've relied on the fact that we had a single
message channel in the global agent, forcing an ordering of messages for
a given cpu, such that we handle the messages in the order they were
sent.  That is no longer true, as we move to having multiple channels
with tasks sharded among the channels.  Messages for a task are always
sent in order to a particular channel, but tasks running on the same cpu
might not use the same channel.

To allow the agent to manage cpu state when receiving out-of-order
messages pertaining to a cpu, we now send the cpu_seqnum.

The cpu_seqnum is a per-cpu "history" counter included with all messages
that contain cpu state.  This state depends on the message.  For
TASK_LATCHED, it is that the task is latched on that cpu.  A TASK_PREEMPT
for task0 on cpuA that means "cpuA has no ghost task running" for that
instant when we sent the preempt.

That state is true when we send the message.  But right after the message
was sent, we could have a ghost task on cpu: perhaps by a completed
transaction, which does not involve a message.

The agent can use the cpu_seqnum to discard old cpu state information.
This requires that all cpu state be determined by the current message.
It'd be a pain to recreate state by handling *all* messages.
Addtionally, we'll eventually need to handle losing messages, so we
don't want to rely on reconstructing state from out-of-order messages,
since some messages may never arrive.

To be submitted in conjunction with cl/387829782.

Tested: edf_test, agent + switchto_test.  No changes in userspace yet.
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I088cf6de2e7ade84f29cd53b05e1854dd01190a4
cpu_seqnum tells the user which moment in time we were at when we
committed the transaction.  Agents which use Async transactions (i.e.
don't spin until the kernel commits it) treat the completion of the
transaction as an implicit TASK_LATCHED message.  By adding cpu_seqnum
to the transaction, we correctly order this implicit message with the
real cpu messages.

It is safe to increment the cpu_seqnum independently of the message
stream: unlike our current task_barrier, the seqnum does not have a 1:1
relationship with a message.  If we wanted to, we could increment that
counter at arbitrary points.

When a global agent handles out-of-order messages and there were pending
transaction, it is difficult to determine what to set cs->current to.

The agent only updates cpu state when it gets a message with a newer
cpu_seqnum.  However, if there is a pending transaction (CpuState
cs->next != NULL) it must SyncCpuState at some point.  This function
checks to see if a transaction is complete, sets cs->current = cs->next,
and sets cs->next->state = ON_CPU.

The agent wants to sync before handling messages for cs->next; e.g. if
we get TaskBlocked for next, we want to set its state to ON_CPU first.
(Arguably, we could ignore this).

Since we may handle messages out of order, and only the most recent
message should mess with CpuState, we need to Sync in any message, even
messages for newer tasks on that cpu.

Consider this:  Messages and cpu_seqnum are in parens.

01: task0 is already running
02: agent decides to preempt task0 with task1
03: task0 blocks  (B0)
04: cpu has no latched task, runs BPF
05: BPF latches task2
06: task2 runs (L1 - sent when t2 is on_cpu)
07: agent commits txn to run task1
08: task2 is preempted (P3, on ctx_switch, since it was on_cpu)
09: task1 blocks (B4)
10: BPF latches and runs task3 (L5)

There are other similar scenarios, such as if task2 is latch-preempted,
if you change 6,7,8 -> 7,6,8 (t2 doesn't run, but we get a TaskLatched).

Due to out-of-order messages, the only constraint on messages is that
the agent receives L1 before P3, since they both belong to the same
task.  The agent will use the cpu_seqnum to know which is the most
recent version of history.  Additionally, when the agent checks
messages, it may be at any time after step 7.  Let's say all steps
completed.

Say we receive L1 first.  When we handle it, cs->next is set.  If
messages were in order, we could wait until we get B4 to sync state,
since B4 pertains to task1 and is the message that must happen after the
commit.  And we need to set cs->current at some point: if we don't, we
may have a task on_cpu but current is not set.

But since messages are not in order, if we don't Sync now (when we get
L1), we might not have another opportunity to Sync, because B4 could be
older in history.  In the case here, B4 is *newer*, so we will be able
to muck with cs->current.  But when we're at L1, we don't know that yet.
We could actually be handling L5 - there's no good way of knowing.  All
we know is that we're more recent than the previous cs->cpu_seqnum.

The root of the issue is that when we SyncCpuState, we are inferring a
TASK_LATCHED message: we know a commit succeeded, so we set cs->current.
If we asked for a TASK_LATCHED with the commit (which is what BPF does),
then we would be able to just set cs->current in the TaskLatched()
handler, and *not* do it in SyncCpuState (i.e. have Sync only reap the
commit, but not muck with *cpu* state).  But since we don't want to get
TASK_LATCHED messages all the time, we infer that the latching happened.

The solution is to treat the completion of the transaction as a change
in the cpu's history: increment and report the cpu_seqnum.  That allows
userspace to both reap the transaction as well as (optionally) change
its CpuState: if the transaction represents a newer version of history.

To be submitted in conjunction with cl/387829782.

Tested: edf_test

Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I0d7988ba7a272e96841974d7886e13e975888886
TASK_DEPARTED has a cpu parameter, but if the task was not actually
running, then the agent can't infer anything about the cpu.  In
particular, it does not know if a switchto chain ended (even if
from_switchto is not set) or if there is another ghost task current.

In the new world of out-of-order messages, the main thing about
from_switchto is for the *task*, not for the cpu.  It's the kernel's way
of saying "you probably weren't expecting this task to run, since you
thought it was blocked, but it woke up, ran, and then
blocked/preempted/departed".  Since from_switchto is pertinent only for
the task, we can't use cpu state to help interpret it.  (Also,
was_current will always be true for every message other than departed,
where it is occasionally false).

But from_switchto no longer conveys *useful* information about the cpu
itself.  It does tell us at that point in cpu_seqnum history that we
left switchto.  But when we handle a message with from_switchto=true in
the agent, we might already have handled a *newer* message (one with
from_switchto=false) and we left ST in response to that newer message
(it's an "implicitly left ST, discussed below).  Alternatively, we might
have not received the original TaskSwitchto message yet.

Keep in mind this rule: the agent can only use the most recent
cpu_seqnum to adjust cs->current or cs->in_switchto.

So if we receive a message that is newer than any others, then we can
adjust in_switchto.  For almost every message, we'll leave the ST chain:
basically any message that means "X is or was running on this cpu"
(latched, blocked, preempted, departed(was_current=true)".  I think of
these as implicitly leaving ST.

The important part is that whether or not we leave ST is independent of
whether the payload's from_switchto was set.  We leave ST as soon as we
get a message that implied the ST chain ended.  Later on, we might
receive an older message with from_switchto set: but we already set
cs->in_switchto=false.  We might also receive an older message
TaskSwitchto.  But it's old, so we ignore it: we already left.

Here's another scenario that shows why we can't use from_switchto to
adjust cs->in_switchto:

  ... ST1(task0), B2(task1,from_st=true), L3(task2), ST4(task2), ...

those are the messages in the order they were sent: we enter ST, then
block with from_st=true.  Then BPF latches a new task, then that task
STs.

We could receive those messages out of order:  (L3 and ST4 are from the
same task, so they are in order).

  possible order of handling:

  ST1, L3, ST4, B2(from_switchto=true).

When we handle B2, we can't touch cs->in_switchto, since 2 < 4.
Otherwise, we'll falsely think we're no longer in a ST chain.  The ST
chain ended with L3.  All the agent knew was that we were in an ST
chain, then suddenly we latched something.  The agent knows it'll
eventually get some message with from_switchto set, but due to the
above scenario, it can't wait until it gets B2 to adjust in_switchto,
since we might be back in a ST chain.

Also: I was slightly worried that the was_current check wouldn't handle
the case where the departed task was latched, but was_current=false.
However, in that case, the kernel will send a preemption before the
departed(was_current = false).  So it's OK.

Also also: Neel pointed out that the agent will be able to check
task->on_cpu() to determine if it was current or not.  This flag makes
it a little easier on the agent, but is essentially a double-check.
Since the task's messages are delivered in order, the agent would know a
task latched (either via MSG_TASK_LATCHED or by completing a TXN) before
it handled MSG_TASK_DEPARTED.

To be submitted in conjunction with cl/387829782.

Tested: agent + switchto_test and simple_exp
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I398a6f2e4b315d78864d8a85e74f7c57f780f415
Change-Id: I67d6f61b521238f774c710d1232b5978b12d7eb3
Prior to this change 'txn->commit_flags' was used to communicate
singular, mutually exclusive values:
- greedy commit		(commit_flags = 0)
- inline commit		(commit_flags = COMMIT_AT_TXN_COMMIT)
- PNT commit		(commit_flags = COMMIT_AT_SCHEDULE)

This made it impossible to use 'commit_flags' to communicate any
other information to the kernel. For example ALLOW_TASK_ONCPU is
communicated via 'run_flags' even though 'commit_flags' would be
a better choice (ALLOW_TASK_ONCPU is only relevant at commit time
as opposed to when the task is running).

Fix this by interpreting 'commit_flags' to contain individual
flag values as opposed to a singular enumeration value.

Note that there is no change in ABI.

Tested: all unit tests pass in virtme

Effort: sched/ghost
Change-Id: I0fa869fce9ca30870c29f6637e92fd9edcf28fae
The effect of ALLOW_TASK_ONCPU is limited to when a transaction
is committed so it goes into txn->commit_flags.

Submitted in conjunction with cl/388747781

Tested: verified all unit tests in virtme.

Effort: sched/ghost
Change-Id: I5a8fb2e9c964d61527252f355b010730f419aaf4
Advertising a departing task as runnable invites a race where the agent
can try to schedule the task before it handles the TASK_DEPARTED msg.

In this case the commit fails with a GHOST_TXN_INVALID_TARGET error
which is a fatal error in the agent (b/195081642 has more details).

Forcing 'runnable=false' in this situation leads to a couple of
discrepancies:

- 'task_new->runnable' is not coherent with the GHOST_SW_TASK_RUNNABLE
  flag in the task's status_word (but ultimately fine since this is
  indistinguishable from a blocked task that woke up while the agent
  was handling the TASK_NEW).

- 'task_new->runnable' is not coherent with 'departed->was_current'
  but this should be okay since 'was_runnable' deals with task state
  whereas 'was_current' deals with cpu state.

Tested:
- 10000 iterations of api_test --gunit_filter=ApiTest.DepartedRace
- kokonut test //sched:ghost_smoketest
  http://sponge2/a5f65f0e-68e4-4152-94c8-e1d6628bc42c

Effort: sched/ghost
Google-Bug-Id: 195081642
Change-Id: I89ebea15961bd70c2899f2c324d3e15be6fc49d3
Replace the ghost_switchto_disable sysctl with a "switchto_disabled"
enclave tunable. This tunable has the same meaning as the sysctl and is
accessible via ghostfs (similar to runnable_timeout).

The associated test is modified in cl/387863829. The presubmit failure
will go away once gtests is updated to that CL.

Tested: //prodkernel/tests/switchto_tunable_test (cl/387863829) in virtme

Effort: sched/ghost
Google-Bug-Id: 195752832
Change-Id: I71c672d0bd07d0e06694b623610d2d05b8b02292
Barret Rhoden and others added 30 commits March 17, 2022 20:22
There's no way to prevent ghost client tasks from grabbing the
kernfs_mutex.  It's used in sysfs and other places.

If an agent task, responsible for scheduling tasks, ever attempts to
grab the kernfs_mutex, there's a chance we'll deadlock.  Specifically,
the client can hold the mutex and be descheduled.  If the agent blocks
on acquiring the mutex, it will never schedule the client.

In general, ghostfs operations on open FDs don't touch the kernfs_mutex
- there might be corner cases I didn't see.  But we definitely grab it
when opening paths (walking or lookups).  The agent doesn't currently
open any files from ghostfs from agent tasks, so this warning hasn't
fired yet.  But at least we'll catch it if it does and avoid a potential
source of deadlock.

Tested: edf_test, enclave_test, api_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I38422580ccc6b098fd2ee556d83e09f919a467cb
This adds an extra field, 'type' to the ghost_timer_fd structure. This
allows the agent to distinguish between differnt types of timers that it
may queue. The cookie field in the timer_fd then acts as an argument to
be interpreted based on the type of timer.

Paired with cl/414055464.

Tested: In combo with the userspace changes.

Effort: sched/ghost
Change-Id: I80bbded859b338123b0873fdb91bf554034113f4
In preparation for merging into sched/next get rid of some
gratuitous diffs.

Tested: all unit tests pass in virtme.

Effort: sched/ghost
Change-Id: I87885570714ee0709984ce657c96a0a1655e20e3
gbuild -s ARCH=arm64

Tested: amd64 builds

Effort: sched/ghost
Change-Id: I66b1f122a89fddb41c1065f7e7b16b3357b8028d
Adding the 112 byte struct in the middle of 'struct rq' perturbed the
cache locality of the following members which exhibited as increased
cputime in the SwitchFutex benchmark (especially when the threads are
not in the same LLC domain on AMD cpus).

Effort: sched/ghost
Change-Id: If17714649c9d351e85aa0e759f57b64d29b6adc4
Currently, whether or not prev is allowed to be selected by PNT is
controlled by rq->ghost.must_resched, which is protected by the RQ lock.

Setting must_resched means that the next time PNT runs, prev will be
preempted by something - even if it is the idle task.  (When prev is not
an agent).

An agent can request that prev be unscheduled by issuing a NULL
transaction.  This grabs the RQ lock.  We will need to preempt tasks
from BPF-MSG, which already holds an RQ lock, so we need to set
must_resched, or something similar, locklessly.

The lockless nature of this is tricky - if we're in PNT, already ran
BPF-PNT, and then the agent tries to resched the cpu, do we want the
resched to take effect or not?  Was the agent referring to prev (who is
different than the latched task)?  Or referring to the task they just
latched?

This sounds a lot like why we have barriers.  The agent wants to change
state, based on its understanding of the system.  Only apply that change
if the agent's understanding matches reality.

We already have the cpu_seqnum: the history/state of a given cpu,
exposed in all relevant messages.  Even better, unlike the barriers,
userspace has no assumptions about being told about every increment to
cpu_seqnum.  We can increment it whenever any cpu_state changes and we
don't have to send a message.

So when the agent (via bpf, in an upcoming commit), asks us to resched,
they can tell us the cpu_seqnum they got from their last TASK_LATCHED.
As a side note, this means that they should *not* use the return value
of bpf_ghost_run_gtid() to mean "it's ok to resched right now!", since
they don't know the cpu_seqnum.  Even if we returned that value, as soon
as we send TASK_LATCHED, we increment again (it's another change in cpu
state, perhaps better renamed as TASK_ON_CPU, but that's a separate
issue).  The agent should only resched based on the latched->cpu_seqnum
from when prev got on cpu.

Tested: edf_test

Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: Ie9edaab22ae2f77078ca2b887fe72d4d528b2e80
This is a bpf helper, callable from any enclave BPF program (i.e.
BPF-PNT, BPF-MSG, or skip-tick).  It will force the cpu to reschedule,
such that "prev" will not be picked again.

Agents can use this to preempt a task.  When the cpu reschedules,
assuming there is no latched task, BPF-PNT will run so that the agent
can pick another task to latch.

A few notes:
- You can call this from within BPF-PNT.  If you call it on your own
cpu, it will have no effect, mostly, similar to setting need_resched
during pick_next_task().  The technical reason it won't work is that the
kernel calls clear_preempt_need_resched() right after pick_next_task(),
so any "need_resched" set during PNT will be a noop.  "Mostly", since
you could write "cpu_seqnum + N" and have must_resched set at some
arbitrary point in the future.  (The kernel cleans this up during
cpu/agent teardown).

- check_same_enclave() works so long as you are in an rcu read critical
section.  We're a bpf helper, and our bpf programs are run under an
rcu_read_lock().  Someone could remove the cpu from the enclave, but
that cpu cannot be handed out to another enclave until after a grace
period.  This is the case for all of the ghost bpf helpers.

- I think you don't need an smp_wmb() between the WRITE_ONCE and the
resched_cpu_unlocked() (which is a prodkernel function).  That has a
CAS in it, and in general, I'd expect sane behavior between the
ordering of writes and IPIs in Linux.  Though that might not be sane.

To be submitted with cl/422883910.

Tested: edf_test, using the helper in bpf-msg.
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I2746e6c789010c3a1b0014483d3d2b788723149f
Certain operations, particularly ghost bpf helpers, are operating on
some enclave, but it might not be the enclave that owns the cpu.

Consider an enclave on a set of cpus, and a task on another cpu enters
the enclave.  That generates a TASK_NEW, which triggers BPF-MSG.  That
all runs on a cpu that is not in the enclave.

Add per-task *__target_enclave to track which enclave we are targeting
with our functions.

Users of the target_enclave can nest, such that if we are in the middle
of some operation and get interrupted, the IRQ handler can set its own
target_enclave so long as it restores it.

Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I1cf7bc70bf2bda6d133155b9a8877ca531074587
Directories before files, then alphabetical sorting.  No capital
letters.

Tested: edf_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: Ib9267e091342ea8a84c4e0e639fe3fae5a673fba
put_user() may fail at the end of ghost_associate_queue() while dst_q
was set successfully. Return the status directly instead of writing it
to the ioctl parameter.

To be submitted with cl/427797430.

Effort: sched/ghost
Google-Bug-Id: 202070945
Change-Id: I9a72bde02afa556bf5c644be4bb80c98274c1b3a
Prior to this change a task with a poisoned rendezvous could
produce a MSG_TASK_PREEMPT (e.g. if the next task picked to
run on the cpu belongs to CFS). This is not intuitive since
the overall sync_group has failed. It is also easy to miss
since producing MSG_TASK_PREEMPT is the exception. For e.g.
see cl/427576696 that fixes flaky ApiTest.LocalAgentWakeup

Effort: sched/ghost
Google-Bug-Id: 214648944
Change-Id: Ib13c38f60b8d1acbd292d4de7b8d7ff528f44c4f
At the moment ghOSt supports more than one enclave running
concurrently on the host as long as they all have the same
ABI.
Relax the check in _ghost_resolve_enclave() to reflect that since it
is possible for the task_reaper (which is a CFS task) to
run on any cpu (including one that belongs to an enclave
different than the one it is reaping).

Effort: sched/ghost
Google-Bug-Id: 218869667
Change-Id: I1b79d4468c343a5a837349ba6984f79bf1f1ec40
Instead of using syscalls, use ghostfs to move tasks to an enclave.
This cleans up core.c's setsched code.  We had a weird dance of passing
along a ctl_fd, then converting that to an enclave, but we had to put
the enclave after unlocking the RQ.  It wasn't pretty.

Instead, if we enter the kernel through ghostfs, the enclave is already
known, and its refcounts are managed for us.  This is analogous to using
ghostfs ioctls instead of raw syscalls: we do not need any magic to
figure out which enclave we're using.

To that end, there are two mechanisms to enter ghost, one for regular
tasks and another for agents.

To add a regular task to an enclave, write its pid into enclave_x/tasks.
Write 0 for 'current'.  The tasks file is world writable, so anyone can
join an enclave, if they can find it.  However, to move another task
(pid != 0), you need CAP_SYS_NICE, just like with
sys_sched_setscheduler().

You can read the tasks file to get a list of tasks in the enclave,
though beware that if you have too many tasks, the file might be
truncated (seq file overflow).

To add an agent to an enclave, write "become agent <QFD>" into ctl.  You
need write access to ctl, which an agent will have.  The QFD is the FD
of the queue you want for this agent, -1 for "default".  qfd is the
same parameter we used to pass in the sched_attr.

To be submitted with cl/424409182.

Tested: edf_test, enclave_test, api_test, transaction_test,
manually moving tasks into an enclave from the shell.

Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: Ib4ead6536354259ca1a464e62d72220c0465eef4
Now that there are no other users of ctlfd_to_enclave, we can clean up
how bpf programs are linked.

- We know the ABI from the expected_attach_type, so we don't need to do
the for_each_abi() check.

- We only need the struct fd and the kernfs kn until we incref the
enclave.  Simply returning a kreffed pointer simplifies the code.

- ghost_bpf_link_attach just needs to figure out the ABI.  The rest of
the guts can be in ghost.c.  This means abi->bpf_attach and detach can
be in ghost.c directly.  Further, the ctlfd_to_enclave can be in ghost.c
too.  Future ABIs don't even need to use the ctlfd (though the old
ghost_core.c didn't actually know what an enclave ctl is).

- We no longer do any PROG_TYPE checking in ghost_core.c.  That makes it
easier to change our types and attach points in the future.  e.g.
dropping skip_tick.  We still have PROG_TYPE_GHOST in other places in
the kernel, so it's not doable yet.

Tested: api_test, edf_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: Iea57c6061047dcb0bf0c4bb2e031a86bbc62b689
enum bpf_func_id is created by a macro that is used for an array lookup
from helper ID to string (func_id_str in disasm.c).

We don't want that array to be too big.  However, we also don't want our
helper ID numbers to clash with upstream's function IDs.  That will
happen eventually.

Move the ghost bpf helper IDs out of bpf_func_id so that we can use a
larger number without affecting the array.

The array mapping is optional - it's used in func_id-to-name lookups,
and those lookups are protected in case the func_id > __BPF_FUNC_MAX_ID.

Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I5b221d1834f1afe11aed55f73216b9e3d8915f46
With the growing number of BPF helpers, we'll soon run out of space in
the bpf_func_id space.

Increase the base numbers for PROG_TYPEs, the ATTACH_TYPEs, and helpers:
- progs:   1000
- attach:  2000
- helpers: 3000

There is enough room for growth in upstream's numbers that we'll either
be merged upstream or dead before we run into conflicts.  We're still
well under the 0xffff requirement for attach types (since the upper bits
are the enclave ABI).

To be submitted with cl/424689285.

Tested: edf_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: Ie0600e0b4478d3289377e4a4d58c9e13b4a491aa
Move the ghost attach types to their own enum.

Change-Id: I91ad42032a21ead9dd99741040f26fbfcc4004a2
This lets us do abi-specific parsing of the create command.
Specifically, we can add abi-specific commands after the ABI part of
"create $ID $ABI".

Tested: edf_test

Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: If9a8d678ec8a3072c83768672c57523a04cb3d2e
Origin-9xx-SHA1: b549601a7b45001f48237968d9ba3c8b1cfc56a6
Rebase-Tested-11xx: compiled
Add an optional parameter to "create" for a GID of the new enclave.
Default is root.  Use the caller's EUID as the UID for the enclave.
This parameter to create is ABI-dependent, since it is after version
parsing, meaning we can change it easily.

In an effort to run the agent as someone other than root, we'd like to
chown the files in the enclave_dir to whoever will run there.
Prior to this commit, all ghostfs files are owned by root:root.

Additionally, we'd like to restrict the agent from changing the cpus.
We can chmod g-w, so long as the agent is not the owner.

Putting these together, the agent can use the group access bits, and
whoever creates the enclave (some trusted daemon) can be the owner.  The
reason to make the creator the owner is so that they can chmod cpulist
and cpumask.

Example: enclave ID = 565, ABI = 65, desired GID = 716

(root):/sys/fs/ghost# echo create 565 65 716 > ctl
(root):/sys/fs/ghost# echo 0-10 > enclave_565/cpulist
(root):/sys/fs/ghost# chmod g-w enclave_565/cpulist
(root):/sys/fs/ghost# chmod g-w enclave_565/cpumask
(root):/sys/fs/ghost# ls -la enclave_565/
total 0
dr-xr-xr-x 3 root spaceghostd      0 Feb  3 13:02 .
dr-xr-xr-x 3 root root             0 Feb  3 12:49 ..
-r--r--r-- 1 root spaceghostd      0 Feb  3 13:02 abi_version
-rw-rw-r-- 1 root spaceghostd      0 Feb  3 13:02 agent_online
-rw-rw-r-- 1 root spaceghostd      0 Feb  3 13:02 commit_at_tick
-rw-rw---- 1 root spaceghostd 458752 Feb  3 13:02 cpu_data
-rw-r--r-- 1 root spaceghostd      0 Feb  3 13:09 cpulist
-rw-r--r-- 1 root spaceghostd      0 Feb  3 13:02 cpumask
-rw-rw-r-- 1 root spaceghostd      0 Feb  3 13:02 ctl
-rw-rw-r-- 1 root spaceghostd      0 Feb  3 13:02 runnable_timeout
-r--r--r-- 1 root spaceghostd      0 Feb  3 13:02 status
-rw-rw-r-- 1 root spaceghostd      0 Feb  3 13:02 switchto_disabled
dr-xr-xr-x 2 root spaceghostd      0 Feb  3 13:02 sw_regions
-rw-rw-rw- 1 root spaceghostd      0 Feb  3 13:02 tasks
-rw-rw-r-- 1 root spaceghostd      0 Feb  3 13:02 wake_on_waker_cpu

(spaceghostd):/sys/fs/ghost/enclave_565$ chmod g+w cpulist
chmod: changing permissions of ‘cpulist’: Operation not permitted

(spaceghostd):/sys/fs/ghost/enclave_565$ cat cpulist
0-10

(spaceghostd):/tmp/$ ./agent_muppet --enclave /sys/fs/ghost/enclave_565/

That actually fails due to missing CAP_BPF, but the ghostfs stuff
worked.

Note that to create an enclave, you need write access to ghostfs/ctl,
which is owned root:root.

Tested: creating enclave and attached to it, edf_test, agent_muppet
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I93a6cca6c29e7a318460ef599c0b269d516ec4f2
Origin-9xx-SHA1: 799b9eb05be77331ec9b64eb9abe8dcc2b40182c
Rebase-Tested-11xx: compiled
It should come as no surprise that calling the old ghost_destroy_enclave()
directly can cause issues.  We already had infrastructure to destroy the
enclave with a work_struct, which helped when destroying the enclave
from tricky sched code (IRQ context).

It turns out that we also need the work_struct to handle destroying the
enclave from ghostfs!  Why?  Consider the case where the task destroying
the enclave is in the enclave, e.g. bash joined the enclave.  We'll hang
after the synchronize_rcu() call.  Essentially, we're still in the
enclave, the agents were killed, and we need someone to schedule us.  We
haven't reached the part of the code that kicks all the enclave tasks
back to CFS.

The cleanest thing is to just defer the work to CFS, which we already do
when we destroy the enclave from IRQ context.  Now, anyone can call
ghost_destroy_enclave() without worrying about their context or other
race conditions.  Even an agent task should be able to do it.

Tested: edf_test, moved bash into an enclave and destroyed via ghostfs

Effort: sched/ghost
Google-Bug-Id: 216664048
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: Ief60ecfec4df05bab7d157f60e04d5e5f42a25bd
Origin-9xx-SHA1: 28e5f4621e10640f34f6ac8b885239c4f51d0c51
Rebase-Tested-11xx: compiled
Skip-tick was our first BPF use case.  Anything you could do with
skip-tick can be done with BPF-MSG.  However, not all agents want to use
BPF, particularly for something as trivial as suppressing all cpu tick
messages.

Add an enclave tunable, "deliver_ticks", which defaults to 0.  When set,
we'll deliver cpu tick messages during the ghost timer tick.  At this
point, BPF-MSG can intercept the message, handle it, and optionally
suppress it - just like any other message.

To be submitted with cl/427258783.

Tested: edf_test, agent_exp with/without --ticks,
agent_muppet + simple_exp, api_test

Effort: sched/ghost
Google-Bug-Id: 210860596
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I8beae844f241b94d8e80dbdbb12660e536bda607
Origin-9xx-SHA1: 4ee077db4d91c8cc3307188e953e0b336cf5eebf
Rebase-Tested-11xx: compiled
GHOST_BPF and bpf_ghost_sched were in the main uapi bpf.h, but not the
tools/ version of the header.

It's not as simple as copying the entire bpf.h header from the kernel to
tools/, since there are a bunch of other differences in non-ghost code,
mostly in comments.

Tested: //net/bpf:bpf_test_suite:test_progs
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: Ica002dabf68fba633fc511b6a4130a809256f552
Origin-9xx-SHA1: 9321a72de9b96bd566cacc5867e2799507aa01f4
Rebase-Tested-11xx: compiled
…grams

Once we've inserted the bpf programs (attached or linked), we no longer
need to load or attach/link more programs.

In the off chance our agent is compromised, by disabling bpf_prog_load,
the agent would be unable to load new hostile programs that it could
attach to other places in the kernel.  Our agents typically run with
CAP_BPF, though not necessarily CAP_PERFMON, so the list of bpf programs
we could load and attach is limited.  This helper limits it further.

The existing programs we already loaded are considered "known good".
They cannot be attached to random places in the kernel: only to BPF-PNT
and BPF-MSG.

This helper is a one-way operation for the calling process.  A new agent
that attaches to the enclave will be able to load and attach new
programs.  Note you cannot have more than one program attached per
enclave, per attach point.

To be submitted with cl/429130082.

Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I4895e78b0c7e49006f16dddb7341f109a3bd006b
Origin-9xx-SHA1: 49667d54efeb559e9dd375f68ff88da0524e0f3b
Rebase-Tested-11xx: compiled
No callers were checking the return value.

Tested: edf_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I2bbb912c61b11358d2cf8dd3593c689f647415c2
Origin-9xx-SHA1: 8bb13a445545f9cdd0dac7eab761311e781bd056
Rebase-Tested-11xx: compiled
Previously, BPF-PNT ran only as a last resort, after several things:

1) if there is no runnable agent (not blocked_in_run)
2) if we will not yield to any other class between PNT-A and PNT-G,
   specifically to CFS (rq_adj_nr_running)
3) if there is no latched task
4) if the current task won't keep running (must_resched)
5) if there's no pending transaction (unlatched, but sitting in the txn
   region, perhaps with COMMIT_AT_SCHEDULE).

After all of that, we'll run BPF, then since we unlocked, go back and
recheck things like if the agent woke, if CFS woke, if bpf latched a
task, etc.

At the moment we unlocked the RQ to run BPF-PNT, cases 1, 2, and 3 were all
possible to become false, which we were able to handle.  However, that's
a rare race.  By exposing the RQ status via the context struct
bpf_ghost_sched, BPF-PNT can make the same decisions it did before: e.g.
if ctx->agent_runnable, then don't bother latching a task.

The BPF program can detect cases 3 and 4 from ctx->next_gtid, but there
is no nice way to detect case 5.  The only real user was
COMMIT_AT_SCHEDULE, which was added to speed up slow global agents and
is largely unused.  BPF-PNT solves the issues of COMMIT_AT_SCHEDULE, so
I recommend not using both BPF-PNT and COMMIT_AT_SCHEDULE.

Running BPF later and redoing the checks was a minor inconvenience, but
we were essentially deciding for BPF-PNT what it wanted to do, and
presumed that there was nothing other than latching a task that it could
do.

Submitted with cl/435677777.

Tested: api_test, edf_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I4ba012add422ee34c4d2a103b1e19e9bd95cd765
Origin-9xx-SHA1: dc74a30f80d9b03483ef95f5797ae9fd7d3609d2
Rebase-Tested-11xx: compiled
Fixed in commit 875ee64be8cd ("sched/ghost: fix bpf_helper enclave
discovery").

Tested: compiled
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I01ca180a51ec04e4adfcc135ebd1cf2e63fafbc3
Origin-9xx-SHA1: 79d2ede01b2e3c178f72daf32426ea29fe385d0a
Rebase-Tested-11xx: compiled
select_task_rq() did not always call select_task_rq_ghost(), and even if
it did, it could ignore the answer - particularly if the cpu was not in
the task's allowed_cpus.

The RQ doesn't matter much for ghost.  The "real" RQ is in the agent,
but the kernel still uses the (per-cpu) RQ for its own bookkeeping,
including the steps needed to wake a task.  For an in-kernel scheduler
like CFS, the RQ matters: it's where the task will compete for cpu time
and may ultimately run.  For ghost, it's a temporary staging ground
until a task gets latched (at which point it will change RQs).

Of note is the TTWU_QUEUE feature.  Normally, when we do a 'remote'
wakeup, meaning select_task_rq() picks a cpu other than the caller's,
we'd grab their rq lock, do the wakeup, and maybe IPI if we need to
preempt.  With TTWU_QUEUE, we can put the task on the RQ's wake_list,
then send an IPI (amortized).  This trades an IPI for lock bouncing -
presumably if the task will actually run on its old cpu, then this will
help.

The problem arises when we pick a rq that is neither the task_cpu (i.e.
previous cpu it ran on) nor the current cpu.  This third-party cpu can
get overloaded with resched IPIs from TTWU_QUEUE remote wakeups.

CFS tries to find an idle RQ, and doesn't aggressively load balance or
otherwise migrate tasks from a cpu.  Ghost does neither.  We offer a
choice between task_cpu and the current cpu (waker's cpu).  And we can
yank tasks off remote RQs without hesitation: this is what happens
during a latch.

select_task_rq_ghost() would normally be OK, but prior to this commit,
select_task_rq() would ignore the result under certain circumstances,
specifically if the cpu was not in p->allowed_cpus.  The ghost kernel
ignores affinity: that's up to the agent to enforce, yet
select_task_rq() was enforcing affinity, but only for the 'temporary'
rq - not the actual execution of a task.

Consider a task that was affined to numa node 1, yet the agent ran it on
node 0.  It wakes on node 0, and select_task_rq_ghost() returns a cpu
from node 0.  select_fallback_rq() will pick the first cpu on node 1 to
do the wakeup, and using TTWU_QUEUE, it will send an IPI and that cpu
will have to do all the wakeup bits.  For ghost, that includes sending
the wakeup message, which can run a bpf-msg program.

Once the task is woken up, another cpu (perhaps from node 0) can latch
it - essentially pulling it away from the victim cpu.  If that task
blocks and wakes quickly, and it was running on node 0, the cycle will
continue.

If you have enough cpus running/blocking/waking tasks, and shipping
their wakeup work to a victim cpu, then that cpu will get stuck in IRQ
context, constantly handling new resched IPIs.  You'd need enough cpus
to keep the victim busy.

The aggrivating factor to this is bpf-msg.  bpf programs will finish in
a bounded amount of time, but there's nothing that stops them from being
inefficient.  The world's simplest scheduler (my Biff scheduler in a
forthcoming CL) is a global queue, implemented with a
BPF_MAP_TYPE_QUEUE, which is protected by a kernel spinlock, which is an
MCS queue lock.  If you have all the cpus hammering that lock too, it
will make bpf-msg less efficient, to the tune of 40-50us per task
ttwu().

Even without bpf-msg, you'd need only enough cpus to keep the victim
busy.  If the wakeup cycles is W_c, and the time to block, wake, run,
and reblock is T_c (task), you'd need T_c / W_c cpus to keep the victim
stuck in its IPI handlers.  But a combination of bpf-msg and a really
dumb scheduler increased W_c.

The trick is to break the cycle: by requiring that select_task_rq()
returns either task_cpu() or the current cpu, we avoid this "stuck in
resched IPI" scenario:
- if it's the current cpu, we don't need an IPI.
- if it's task_cpu, then when some other cpu pulls it from the victim's
  RQ, that cpu becomes the new task_cpu, breaking the cycle.

Tested: SLL
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: Ib65a1672420c72ff9d3bf71c0d50eba245a1019d
Origin-9xx-SHA1: 37588684a0ce59fb763a4deb2fea31078e47984c
Rebase-Tested-11xx: compiled
The agent's bpf-msg program returns whether or not they want us to send
the message to their Channel.  If they do not want the message, then we
can free the status word directly in the kernel.

Normally agents free the status word after a dead or departed message.
If there is no message, userspace will not free the status word.

Tested: edf_test, api_test
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I01f02e92581fe27405af1cae4345bb78b9aafc7b
Origin-9xx-SHA1: d673ad0e788d84be97a83b984367b84c64c83011
Rebase-Tested-11xx: compiled
"discover tasks" will generate a task_new for each task in the enclave.

This is a mechanism for live update: an agent can discover tasks that
were already in the enclave before it started handling messages.

Userspace agents that inherit from BasicDispatchScheduler() have their
own discovery mechanism based on the StatusWordTable: for each status
word, associate the task with the agent's channel.  We use association
to make sure we run exactly one TaskNew() for each task, including tasks
that arrive concurrently.

For agents that do not have a substantial userspace presence, such as
the biff scheduler (bpf-only), it's simpler to generate task_new
messages.  Your scheduler just needs to ignore task_new for tasks it
already knows about.

Tested: agent_biff handoff
Effort: sched/ghost
Signed-off-by: Barret Rhoden <brho@google.com>
Change-Id: I70d796d11074b4cb7d638d8531cead9fe3ef0b2b
Origin-9xx-SHA1: 9dbec5a13eba0a8cce16bc2582f0191afb26aaa5
Rebase-Tested-11xx: compiled
Add #define GHOST_BPF to scripts/bpf_helpers_doc.py so that the
GHOST_BPF macro is generated into bpf_helper_defs.h, which makes it
visible to common.bpf.h (see cl/460745992). This allows us to guard the
magic numbers there with #ifndef GHOST_BPF.

Tested: pushed this change to test branch on github repo, edited
WORKSPACE file locally in userspace repo and was able to build on GCE
instance
google@eb2fd69

Effort: sched/ghost
Change-Id: I9edc790956aa92f4ca157180891c7fd6727a99d6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants