Skip to content

Conversation

JakeHillion
Copy link
Contributor

Use the new scx_bpf_dsq_peek in scx_chaos to optimise for the fast path. This avoids locking the DSQs and should be beneficial in the common case where the DSQ is empty/nothing in the DSQ is ready.

Add a few stats for tracking how successful peek is. This works really well on my local machine for skipping the hot path.

This mostly avoids contention with the crawling timer thread, as the insertion in chaos_enqueue and removal in chaos_dispatch are all local to one CPU and the locking overhead would be minimal.

Test plan:

  • CI
jake@merlin:/data/users/jake/repos/scx/ > cargo build --release -p scx_chaos && sudo target/release/scx_chaos --random-delay-frequency 0.01 --random-delay-min-us 100000 --random-delay-max-us 200000 --stats 10
...
    Finished `release` profile [optimized] target(s) in 1m 01s
11:28:59 [INFO] Running scx_chaos (build ID: 1.0.20-ga6134e95-dirty x86_64-unknown-linux-gnu)
11:28:59 [INFO] Builder { traits: [RandomDelays { frequency: 0.01, min_us: 100000, max_us: 200000 }], verbose: 0, kprobe_random_delays: None, p2dq_opts: SchedulerOpts { disable_kthreads_local: false, autoslice: false, interactive_ratio: 10, deadline: false, eager_load_balance: false, freq_control: false, greedy_idle_disable: true, interactive_sticky: false, interactive_fifo: false, dispatch_pick2_disable: false, dispatch_lb_busy: 75, dispatch_lb_interactive: true, keep_running: false, atq_enabled: false, cpu_priority: false, interactive_dsq: true, wakeup_lb_busy: 0, wakeup_llc_migrations: false, select_idle_in_enqueue: false, queued_wakeup: false, idle_resume_us: None, max_dsq_pick2: false, task_slice: false, min_slice_us: 100, lb_mode: Load, sched_mode: Default, lb_slack_factor: 5, min_llc_runs_pick2: 1, saturated_percent: 5, dsq_time_slices: [], dsq_shift: 4, llc_shards: 5, min_nr_queued_pick2: 0, dumb_queues: 3, init_dsq_index: 0, virt_llc_enabled: false, topo: TopologyArgs { virt_llc: None } }, requires_ppid: None }
11:28:59 [INFO] DSQ[0] slice_ns 100000
11:28:59 [INFO] DSQ[1] slice_ns 3200000
11:28:59 [INFO] DSQ[2] slice_ns 6400000
11:28:59 [WARN] libbpf: map 'chaos': BPF map skeleton link is uninitialized

chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 1057/0/0
chaos traits: random_delays/cpu_freq/degradation 3/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 3
peek: empty/not_ready/needs_proc 107168/309/9716
chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 91787/0/15417
^C11:29:23 [INFO] Unregister scx_chaos scheduler

Use the new `scx_bpf_dsq_peek` in scx_chaos to optimise for the fast
path. This avoids locking the DSQs and should be beneficial in the
common case where the DSQ is empty/nothing in the DSQ is ready.

Add a few stats for tracking how successful peek is. This works really
well on my local machine for skipping the hot path.

This mostly avoids contention with the crawling timer thread, as the
insertion in chaos_enqueue and removal in chaos_dispatch are all local
to one CPU and the locking overhead would be minimal.

Test plan:
- CI

```
jake@merlin:/data/users/jake/repos/scx/ > cargo build --release -p scx_chaos && sudo target/release/scx_chaos --random-delay-frequency 0.01 --random-delay-min-us 100000 --random-delay-max-us 200000 --stats 10
...
    Finished `release` profile [optimized] target(s) in 1m 01s
11:28:59 [INFO] Running scx_chaos (build ID: 1.0.20-ga6134e95-dirty x86_64-unknown-linux-gnu)
11:28:59 [INFO] Builder { traits: [RandomDelays { frequency: 0.01, min_us: 100000, max_us: 200000 }], verbose: 0, kprobe_random_delays: None, p2dq_opts: SchedulerOpts { disable_kthreads_local: false, autoslice: false, interactive_ratio: 10, deadline: false, eager_load_balance: false, freq_control: false, greedy_idle_disable: true, interactive_sticky: false, interactive_fifo: false, dispatch_pick2_disable: false, dispatch_lb_busy: 75, dispatch_lb_interactive: true, keep_running: false, atq_enabled: false, cpu_priority: false, interactive_dsq: true, wakeup_lb_busy: 0, wakeup_llc_migrations: false, select_idle_in_enqueue: false, queued_wakeup: false, idle_resume_us: None, max_dsq_pick2: false, task_slice: false, min_slice_us: 100, lb_mode: Load, sched_mode: Default, lb_slack_factor: 5, min_llc_runs_pick2: 1, saturated_percent: 5, dsq_time_slices: [], dsq_shift: 4, llc_shards: 5, min_nr_queued_pick2: 0, dumb_queues: 3, init_dsq_index: 0, virt_llc_enabled: false, topo: TopologyArgs { virt_llc: None } }, requires_ppid: None }
11:28:59 [INFO] DSQ[0] slice_ns 100000
11:28:59 [INFO] DSQ[1] slice_ns 3200000
11:28:59 [INFO] DSQ[2] slice_ns 6400000
11:28:59 [WARN] libbpf: map 'chaos': BPF map skeleton link is uninitialized

chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 1057/0/0
chaos traits: random_delays/cpu_freq/degradation 3/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 3
peek: empty/not_ready/needs_proc 107168/309/9716
chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 91787/0/15417
^C11:29:23 [INFO] Unregister scx_chaos scheduler
```
@JakeHillion JakeHillion requested a review from rrnewton October 15, 2025 10:30
Copy link
Contributor

@rrnewton rrnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments but no blockers! Looks good.

scx_bpf_pick_any_cpu_node(cpus_allowed, node, flags) : \
scx_bpf_pick_any_cpu(cpus_allowed, flags))

#define __COMPAT_scx_bpf_dsq_peek(dsq_id) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different than the compat macro (actually inline function) in the V5 that we ended up with and TJ moved to his branch. It probably doesn't matter, but it does make me curious what are rules are for updating this file.

Shouldn't it be automatically pulled from kernel somewhere? I don't see any other scripts in the repo that seem to mention compat.bpf.h -- I was expecting some update script that copies it from the kernel..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this has landed upstream now. I grabbed this from #2675 and plan to merge after that goes in, will take whatever's there (I don't actually need the compat macro).

return U64_MAX;
}

first_p = bpf_task_from_pid(first_p->pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could help me understand if this is strictly required or just being extra cautious.

  • we can access the field first_p->pid before the bpf_task_from_pid because it is simple scalar data.
  • the field first_p->scx.dsq_vtime is in the task_struct, inside the nested struct sched_ext_entity struct which is stored contiguously within the parent struct (not via a pointer).

It seems to me that this effectively makes first_p->scx.dsq_vtime accessible without the bpf_task_from_pid/bpf_task_release protocol because it's effectively scalar data. But I take it that there's some rule about not references "complex data", without a proper reference-counted handle, to leave open more implementation leeway (on different architectures/compilers or something) with these nested structs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This is good. I did try removing it before and had a verifier error, but I must have messed it up. Seems fine now on my kernel.

bpf_for_each(scx_dsq, p, get_cpu_delay_dsq(-1), 0) {
// Check if we need to process the delay DSQ
if (delay_dsq_next_time(dsq_id) > now)
goto p2dq;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense as the short-circuit. I want to quickly make sure that the races are benign. Normally, we would expect a peek/vtime reading to be a lower bound -- other cores could race and asynchronously pop the task and increase the vtime after the observation we made (but not decrease it, b.c. of monotonicity).

But here, because we only peek our own delay_dsq here... maybe we don't even need to worry about that because no one else will dispatch from it?

Do we need to rely on non-interference with delay-dsq here? I think not, because if the head of delay_dsq was popped and its head-delayed-time became even FURTHER into the future, then it is even more ahead of our fixed snapshot of now and we remain justified in taking the fastpath goto p2dq.

Nevertheless, reasoning about the concurrency safety here seems a bit scary because bpf_ktime_get_ns is realtime that is constantly changing (monotonically forward, and our "now" snapshots are immediately stale, we don't get to control the forward-march of time like in a simulation). If "now" and "earliest vtime in delay queue" are both variables that change asynchronously in the background... then atomic snapshots of just one of them would never be able to guarantee that a moment in time exists where they have a particular ordering relationship..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants