scx_chaos: use peek operation to optimise for empty delay dsq #2894

JakeHillion · 2025-10-15T10:30:42Z

Use the new scx_bpf_dsq_peek in scx_chaos to optimise for the fast path. This avoids locking the DSQs and should be beneficial in the common case where the DSQ is empty/nothing in the DSQ is ready.

Add a few stats for tracking how successful peek is. This works really well on my local machine for skipping the hot path.

This mostly avoids contention with the crawling timer thread, as the insertion in chaos_enqueue and removal in chaos_dispatch are all local to one CPU and the locking overhead would be minimal.

Test plan:

CI

jake@merlin:/data/users/jake/repos/scx/ > cargo build --release -p scx_chaos && sudo target/release/scx_chaos --random-delay-frequency 0.01 --random-delay-min-us 100000 --random-delay-max-us 200000 --stats 10
...
    Finished `release` profile [optimized] target(s) in 1m 01s
11:28:59 [INFO] Running scx_chaos (build ID: 1.0.20-ga6134e95-dirty x86_64-unknown-linux-gnu)
11:28:59 [INFO] Builder { traits: [RandomDelays { frequency: 0.01, min_us: 100000, max_us: 200000 }], verbose: 0, kprobe_random_delays: None, p2dq_opts: SchedulerOpts { disable_kthreads_local: false, autoslice: false, interactive_ratio: 10, deadline: false, eager_load_balance: false, freq_control: false, greedy_idle_disable: true, interactive_sticky: false, interactive_fifo: false, dispatch_pick2_disable: false, dispatch_lb_busy: 75, dispatch_lb_interactive: true, keep_running: false, atq_enabled: false, cpu_priority: false, interactive_dsq: true, wakeup_lb_busy: 0, wakeup_llc_migrations: false, select_idle_in_enqueue: false, queued_wakeup: false, idle_resume_us: None, max_dsq_pick2: false, task_slice: false, min_slice_us: 100, lb_mode: Load, sched_mode: Default, lb_slack_factor: 5, min_llc_runs_pick2: 1, saturated_percent: 5, dsq_time_slices: [], dsq_shift: 4, llc_shards: 5, min_nr_queued_pick2: 0, dumb_queues: 3, init_dsq_index: 0, virt_llc_enabled: false, topo: TopologyArgs { virt_llc: None } }, requires_ppid: None }
11:28:59 [INFO] DSQ[0] slice_ns 100000
11:28:59 [INFO] DSQ[1] slice_ns 3200000
11:28:59 [INFO] DSQ[2] slice_ns 6400000
11:28:59 [WARN] libbpf: map 'chaos': BPF map skeleton link is uninitialized

chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 1057/0/0
chaos traits: random_delays/cpu_freq/degradation 3/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 3
peek: empty/not_ready/needs_proc 107168/309/9716
chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 91787/0/15417
^C11:29:23 [INFO] Unregister scx_chaos scheduler

Use the new `scx_bpf_dsq_peek` in scx_chaos to optimise for the fast path. This avoids locking the DSQs and should be beneficial in the common case where the DSQ is empty/nothing in the DSQ is ready. Add a few stats for tracking how successful peek is. This works really well on my local machine for skipping the hot path. This mostly avoids contention with the crawling timer thread, as the insertion in chaos_enqueue and removal in chaos_dispatch are all local to one CPU and the locking overhead would be minimal. Test plan: - CI ``` jake@merlin:/data/users/jake/repos/scx/ > cargo build --release -p scx_chaos && sudo target/release/scx_chaos --random-delay-frequency 0.01 --random-delay-min-us 100000 --random-delay-max-us 200000 --stats 10 ... Finished `release` profile [optimized] target(s) in 1m 01s 11:28:59 [INFO] Running scx_chaos (build ID: 1.0.20-ga6134e95-dirty x86_64-unknown-linux-gnu) 11:28:59 [INFO] Builder { traits: [RandomDelays { frequency: 0.01, min_us: 100000, max_us: 200000 }], verbose: 0, kprobe_random_delays: None, p2dq_opts: SchedulerOpts { disable_kthreads_local: false, autoslice: false, interactive_ratio: 10, deadline: false, eager_load_balance: false, freq_control: false, greedy_idle_disable: true, interactive_sticky: false, interactive_fifo: false, dispatch_pick2_disable: false, dispatch_lb_busy: 75, dispatch_lb_interactive: true, keep_running: false, atq_enabled: false, cpu_priority: false, interactive_dsq: true, wakeup_lb_busy: 0, wakeup_llc_migrations: false, select_idle_in_enqueue: false, queued_wakeup: false, idle_resume_us: None, max_dsq_pick2: false, task_slice: false, min_slice_us: 100, lb_mode: Load, sched_mode: Default, lb_slack_factor: 5, min_llc_runs_pick2: 1, saturated_percent: 5, dsq_time_slices: [], dsq_shift: 4, llc_shards: 5, min_nr_queued_pick2: 0, dumb_queues: 3, init_dsq_index: 0, virt_llc_enabled: false, topo: TopologyArgs { virt_llc: None } }, requires_ppid: None } 11:28:59 [INFO] DSQ[0] slice_ns 100000 11:28:59 [INFO] DSQ[1] slice_ns 3200000 11:28:59 [INFO] DSQ[2] slice_ns 6400000 11:28:59 [WARN] libbpf: map 'chaos': BPF map skeleton link is uninitialized chaos traits: random_delays/cpu_freq/degradation 0/0/0 chaos excluded/skipped 0/0 kprobe_random_delays 0 timer kicks: 0 peek: empty/not_ready/needs_proc 1057/0/0 chaos traits: random_delays/cpu_freq/degradation 3/0/0 chaos excluded/skipped 0/0 kprobe_random_delays 0 timer kicks: 3 peek: empty/not_ready/needs_proc 107168/309/9716 chaos traits: random_delays/cpu_freq/degradation 0/0/0 chaos excluded/skipped 0/0 kprobe_random_delays 0 timer kicks: 0 peek: empty/not_ready/needs_proc 91787/0/15417 ^C11:29:23 [INFO] Unregister scx_chaos scheduler ```

rrnewton

Some comments but no blockers! Looks good.

rrnewton · 2025-10-15T18:17:29Z

scheds/include/scx/compat.bpf.h

 	 scx_bpf_pick_any_cpu_node(cpus_allowed, node, flags) :			\
 	 scx_bpf_pick_any_cpu(cpus_allowed, flags))

+#define __COMPAT_scx_bpf_dsq_peek(dsq_id)                                      \


This is different than the compat macro (actually inline function) in the V5 that we ended up with and TJ moved to his branch. It probably doesn't matter, but it does make me curious what are rules are for updating this file.

Shouldn't it be automatically pulled from kernel somewhere? I don't see any other scripts in the repo that seem to mention compat.bpf.h -- I was expecting some update script that copies it from the kernel..

Ah, this has landed upstream now. I grabbed this from #2675 and plan to merge after that goes in, will take whatever's there (I don't actually need the compat macro).

rrnewton · 2025-10-15T18:31:54Z

scheds/rust/scx_chaos/src/bpf/main.bpf.c

+		return U64_MAX;
+	}
+
+	first_p = bpf_task_from_pid(first_p->pid);


Maybe you could help me understand if this is strictly required or just being extra cautious.

we can access the field first_p->pid before the bpf_task_from_pid because it is simple scalar data.

the field first_p->scx.dsq_vtime is in the task_struct, inside the nested struct sched_ext_entity struct which is stored contiguously within the parent struct (not via a pointer).

It seems to me that this effectively makes first_p->scx.dsq_vtime accessible without the bpf_task_from_pid/bpf_task_release protocol because it's effectively scalar data. But I take it that there's some rule about not references "complex data", without a proper reference-counted handle, to leave open more implementation leeway (on different architectures/compilers or something) with these nested structs?

Yes! This is good. I did try removing it before and had a verifier error, but I must have messed it up. Seems fine now on my kernel.

rrnewton · 2025-10-15T19:01:22Z

scheds/rust/scx_chaos/src/bpf/main.bpf.c

-	bpf_for_each(scx_dsq, p, get_cpu_delay_dsq(-1), 0) {
+	// Check if we need to process the delay DSQ
+	if (delay_dsq_next_time(dsq_id) > now)
+		goto p2dq;


This makes sense as the short-circuit. I want to quickly make sure that the races are benign. Normally, we would expect a peek/vtime reading to be a lower bound -- other cores could race and asynchronously pop the task and increase the vtime after the observation we made (but not decrease it, b.c. of monotonicity).

But here, because we only peek our own delay_dsq here... maybe we don't even need to worry about that because no one else will dispatch from it?

Do we need to rely on non-interference with delay-dsq here? I think not, because if the head of delay_dsq was popped and its head-delayed-time became even FURTHER into the future, then it is even more ahead of our fixed snapshot of now and we remain justified in taking the fastpath goto p2dq.

Nevertheless, reasoning about the concurrency safety here seems a bit scary because bpf_ktime_get_ns is realtime that is constantly changing (monotonically forward, and our "now" snapshots are immediately stale, we don't get to control the forward-march of time like in a simulation). If "now" and "earliest vtime in delay queue" are both variables that change asynchronously in the background... then atomic snapshots of just one of them would never be able to guarantee that a moment in time exists where they have a particular ordering relationship..

JakeHillion requested a review from rrnewton October 15, 2025 10:30

rrnewton approved these changes Oct 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scx_chaos: use peek operation to optimise for empty delay dsq #2894

scx_chaos: use peek operation to optimise for empty delay dsq #2894

JakeHillion commented Oct 15, 2025

Uh oh!

rrnewton left a comment

Uh oh!

rrnewton Oct 15, 2025

Uh oh!

JakeHillion Oct 16, 2025

Uh oh!

rrnewton Oct 15, 2025

Uh oh!

JakeHillion Oct 16, 2025

Uh oh!

rrnewton Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scx_chaos: use peek operation to optimise for empty delay dsq #2894

Are you sure you want to change the base?

scx_chaos: use peek operation to optimise for empty delay dsq #2894

Conversation

JakeHillion commented Oct 15, 2025

Uh oh!

rrnewton left a comment

Choose a reason for hiding this comment

Uh oh!

rrnewton Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

JakeHillion Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

rrnewton Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

JakeHillion Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

rrnewton Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants