[RFC] scx_layered: Add tickless layer support #2262

hodgesds · 2025-06-25T19:56:52Z

Add a option to enable tickless scheduling on a layer. This may improve throughput in certain scenarios by reducing the number of context switches.

Not sure this is a good idea, but why not?

bpftool prog trace

        schbench-3267991 [074] .Ns.. 242571.012335: bpf_trace_printk: TICKLESS kicking cpu 150 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012337: bpf_trace_printk: TICKLESS kicking cpu 151 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012339: bpf_trace_printk: TICKLESS kicking cpu 152 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012341: bpf_trace_printk: TICKLESS kicking cpu 153 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012344: bpf_trace_printk: TICKLESS kicking cpu 154 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012346: bpf_trace_printk: TICKLESS kicking cpu 155 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012348: bpf_trace_printk: TICKLESS kicking cpu 156 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012351: bpf_trace_printk: TICKLESS kicking cpu 157 on layer 0

Tested with schbench and performance seemed ok:

Wakeup Latencies percentiles (usec) runtime 30 (s) (324072 total samples)
          50.0th: 12         (97798 samples)
          90.0th: 29         (126658 samples)
        * 99.0th: 158        (28244 samples)
          99.9th: 358        (2901 samples)
          min=1, max=3437
Request Latencies percentiles (usec) runtime 30 (s) (324719 total samples)
          50.0th: 13296      (98116 samples)
          90.0th: 14672      (129660 samples)
        * 99.0th: 25120      (28649 samples)
          99.9th: 38720      (2908 samples)
          min=7319, max=84789
RPS percentiles (requests) runtime 30 (s) (25 total samples)
          20.0th: 12752      (5 samples)
        * 50.0th: 13040      (9 samples)
          90.0th: 13200      (11 samples)
          min=12664, max=13204
average rps: 13003.57

vs eevdf

Wakeup Latencies percentiles (usec) runtime 30 (s) (324608 total samples)
          50.0th: 8          (107233 samples)
          90.0th: 45         (86607 samples)
        * 99.0th: 1262       (28725 samples)
          99.9th: 3916       (2928 samples)
          min=1, max=9487
Request Latencies percentiles (usec) runtime 30 (s) (325386 total samples)
          50.0th: 13136      (94018 samples)
          90.0th: 14384      (130630 samples)
        * 99.0th: 24736      (28822 samples)
          99.9th: 38720      (2917 samples)
          min=6538, max=91121
RPS percentiles (requests) runtime 30 (s) (25 total samples)
          20.0th: 12944      (9 samples)
        * 50.0th: 13008      (10 samples)
          90.0th: 13104      (5 samples)
          min=12852, max=13184
average rps: 13005.63

Using the following config:

[
  {
    "name": "hodgesd",
    "comment": "hodgesd user",
    "matches": [
      [
        {
          "UIDEquals": 12345
        }
      ]
    ],
    "kind": {
      "Grouped": {
        "util_range": [
          0.25,
          0.6
        ],
        "tickless": true
      }
    }
  },
  {
    "name": "stress-ng",
    "comment": "stress-ng slice",
    "matches": [
      [
        {
          "CommPrefix": "stress-ng"
        }
      ],
      [
        {
          "PcommPrefix": "stress-ng"
        }
      ]
    ],
    "kind": {
      "Confined": {
        "util_range": [
          0.05,
          0.80
        ],
        "slice_us": 2000
      }
    }
  },
  {
    "name": "normal",
    "comment": "the rest",
    "matches": [
      []
    ],
    "kind": {
      "Open": {
        "slice_us": 800
      }
    }
  }
]

hodgesds · 2025-06-25T19:57:58Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

 struct layered_timer layered_timers[MAX_TIMERS] = {
 	{15LLU * NSEC_PER_SEC, CLOCK_BOOTTIME, 0},
 	{1LLU * NSEC_PER_SEC, CLOCK_BOOTTIME, 0},
+	{1LLU * NSEC_PER_MSEC, CLOCK_BOOTTIME, 0},


this was chosen arbitrarily, should probably be more configurable but makes working with timers a little annoying

xerothermic · 2025-06-25T20:08:32Z

I'm curious if we boot the kernel with proper tickless options + isolcpus. Could we in theory run our selected user space tasks indefinitely? Or is there still some kernel threads that we'd need to yield once a while to avoid stalls?

hodgesds · 2025-06-25T20:13:12Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+		}
+
+		bpf_for(layer_id, 0, nr_layers) {
+			if (cpuc->layer_id != layer_id)


cpu_ctx also has a task_layer_id field as well, but I think this is right because it's the cpu "owned" layer

hodgesds · 2025-06-25T20:13:45Z

I'm curious if we boot the kernel with proper tickless options + isolcpus. Could we in theory run our selected user space tasks indefinitely? Or is there still some kernel threads that we'd need to yield once a while to avoid stalls?

I think that would be something that still needs to be tested, but IIRC affinitized kthreads can preempt.

arighi

Left a few comments but LGTM

arighi · 2025-06-25T21:05:10Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

 	return true;
 }

+static void tick_layer(struct cpu_ctx *cpuc, u32 layer_id)


Maybe preempt_layer() to make it clear that it's not a tick callback?

tickless_kick_layer()?

arighi · 2025-06-25T21:12:16Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+{
+	u64 dsq_id = layer_dsq_id(layer_id, cpuc->llc_id);
+	if (!scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpuc->cpu) &&
+	    !scx_bpf_dsq_nr_queued(dsq_id))


We may also want to skip the kick if the CPU is idle or if the task running on it hasn't used enough time slice, but the code is simpler as it is and it's probably fine.

Shouldn't it also consider fallback DSQs?

htejun · 2025-06-25T22:22:48Z

scheds/rust/scx_layered/src/bpf/main.bpf.c


 	if (!enable_antistall)
-		return true;
+		return false;


Should this be a separate patch?

ahh @hodgesds this change is noop right? i.e. that's just the return to the timer callback which does nothing IIUC.

yeah, can make it a separate patch... it prevents the timer from rerunning if antistall is disabled.

suspect the option to disable antistall ought to be removed btw

also lol thx for letting me know that wasn't noop, thought return true was just verifier noise.

I refactored the timer interface in #2266. If that makes sense then I'll use that to make the tickless interval configurable.

suspect the option to disable antistall ought to be removed btw

* also lol thx for letting me know that wasn't noop, thought return true was just verifier noise.

Let's not do that, disabling antistall is very useful when validating a new scheduler release behaves, as the kicks are much better than tanking performance on a machine by 90% (or whatever) but staying running.

https://access.redhat.com/solutions/3901121 but ok.

https://access.redhat.com/solutions/3901121 but ok.

https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/commit/?h=for-6.17&id=cb444006a625c60e6d4dd3753863c3c74f96aac3

htejun · 2025-06-25T22:23:47Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

 	return true;
 }

+static void tick_layer(struct cpu_ctx *cpuc, u32 layer_id)


tickless_kick_layer()?

htejun · 2025-06-25T22:25:05Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+{
+	u64 dsq_id = layer_dsq_id(layer_id, cpuc->llc_id);
+	if (!scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpuc->cpu) &&
+	    !scx_bpf_dsq_nr_queued(dsq_id))


Shouldn't it also consider fallback DSQs?

hodgesds · 2025-06-27T08:07:19Z

Rebased with the timer changes and borrowed some of the configuration from scx_tickless. This implementation might be a bit aggressive with kicking, but that's probably a safer way of doing things.

hodgesds · 2025-07-02T16:48:08Z

Updated to use @arighi 's approach of preempting the tickless task during enqueue and no stalls when running with stress-ng to test for stalls:

 stress-ng -t 45 --aggressive -M -c `nproc` --affinity-sleep 1 --affinity-rand --affinity 100 --affinity-delay 1 --affinity-pin

perf tests:

Wakeup Latencies percentiles (usec) runtime 30 (s) (392439 total samples)
         50.0th: 13         (99737 samples)
         90.0th: 32         (157179 samples)
       * 99.0th: 195        (34664 samples)
         99.9th: 472        (3497 samples)
         min=1, max=21448
Request Latencies percentiles (usec) runtime 30 (s) (393086 total samples)
         50.0th: 13520      (114040 samples)
         90.0th: 14416      (158790 samples)
       * 99.0th: 15440      (33131 samples)
         99.9th: 26208      (3427 samples)
         min=7006, max=49908
RPS percentiles (requests) runtime 30 (s) (31 total samples)
         20.0th: 13072      (9 samples)
       * 50.0th: 13136      (17 samples)
         90.0th: 13168      (2 samples)
         min=12640, max=13210
average rps: 13102.87

without tickless shows similar performance:

Wakeup Latencies percentiles (usec) runtime 30 (s) (393181 total samples)
          50.0th: 12         (107217 samples)
          90.0th: 28         (151138 samples)
        * 99.0th: 149        (34415 samples)
          99.9th: 406        (3535 samples)
          min=1, max=6449
Request Latencies percentiles (usec) runtime 30 (s) (393779 total samples)
          50.0th: 13616      (114923 samples)
          90.0th: 14416      (157620 samples)
        * 99.0th: 15184      (35127 samples)
          99.9th: 16544      (3371 samples)
          min=7385, max=28919
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 13072      (7 samples)
        * 50.0th: 13104      (11 samples)
          90.0th: 13200      (13 samples)
          min=12986, max=13197
average rps: 13125.97

vs eevdf:

Wakeup Latencies percentiles (usec) runtime 30 (s) (386557 total samples)
          50.0th: 8          (126273 samples)
          90.0th: 46         (103150 samples)
        * 99.0th: 2042       (34255 samples)
          99.9th: 3964       (3478 samples)
          min=1, max=14768
Request Latencies percentiles (usec) runtime 30 (s) (387364 total samples)
          50.0th: 13136      (115963 samples)
          90.0th: 14480      (155486 samples)
        * 99.0th: 27680      (33933 samples)
          99.9th: 42560      (3456 samples)
          min=6537, max=89863
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 12752      (7 samples)
        * 50.0th: 12912      (10 samples)
          90.0th: 13040      (12 samples)
          min=12551, max=13088
average rps: 12912.13

Add a option to enable tickless scheduling on a layer. This may improve throughput in certain scenarios by reducing the number of context switches. Signed-off-by: Daniel Hodges <hodgesd@meta.com>

hodgesds requested review from JakeHillion, etsal, htejun, likewhatevs and xerothermic June 25, 2025 19:57

hodgesds commented Jun 25, 2025

View reviewed changes

hodgesds requested a review from arighi June 25, 2025 20:00

hodgesds commented Jun 25, 2025

View reviewed changes

arighi approved these changes Jun 25, 2025

View reviewed changes

htejun reviewed Jun 25, 2025

View reviewed changes

likewhatevs requested a review from kkdwivedi June 25, 2025 23:53

hodgesds force-pushed the layered-tickless branch 4 times, most recently from f0b2522 to 7f2bc90 Compare June 27, 2025 08:05

hodgesds force-pushed the layered-tickless branch 2 times, most recently from 9bc93be to 759119b Compare July 2, 2025 16:46

hodgesds force-pushed the layered-tickless branch 2 times, most recently from 9e086a7 to 4e88203 Compare July 2, 2025 16:53

scx_layered: Add tickless layer support

53e3b17

Add a option to enable tickless scheduling on a layer. This may improve throughput in certain scenarios by reducing the number of context switches. Signed-off-by: Daniel Hodges <hodgesd@meta.com>

hodgesds force-pushed the layered-tickless branch from 4e88203 to 53e3b17 Compare July 2, 2025 17:07

[RFC] scx_layered: Add tickless layer support #2262

Are you sure you want to change the base?

[RFC] scx_layered: Add tickless layer support #2262

Uh oh!

Conversation

hodgesds commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xerothermic commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hodgesds commented Jun 25, 2025

Uh oh!

arighi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

likewhatevs Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hodgesds commented Jun 27, 2025

Uh oh!

hodgesds commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hodgesds commented Jun 25, 2025 •

edited

Loading

likewhatevs Jun 26, 2025 •

edited

Loading

hodgesds commented Jul 2, 2025 •

edited

Loading