Skip to content

Conversation

hodgesds
Copy link
Contributor

@hodgesds hodgesds commented Jun 25, 2025

Add a option to enable tickless scheduling on a layer. This may improve throughput in certain scenarios by reducing the number of context switches.

Not sure this is a good idea, but why not?

bpftool prog trace

        schbench-3267991 [074] .Ns.. 242571.012335: bpf_trace_printk: TICKLESS kicking cpu 150 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012337: bpf_trace_printk: TICKLESS kicking cpu 151 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012339: bpf_trace_printk: TICKLESS kicking cpu 152 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012341: bpf_trace_printk: TICKLESS kicking cpu 153 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012344: bpf_trace_printk: TICKLESS kicking cpu 154 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012346: bpf_trace_printk: TICKLESS kicking cpu 155 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012348: bpf_trace_printk: TICKLESS kicking cpu 156 on layer 0
        schbench-3267991 [074] .Ns.. 242571.012351: bpf_trace_printk: TICKLESS kicking cpu 157 on layer 0

Tested with schbench and performance seemed ok:

Wakeup Latencies percentiles (usec) runtime 30 (s) (324072 total samples)
          50.0th: 12         (97798 samples)
          90.0th: 29         (126658 samples)
        * 99.0th: 158        (28244 samples)
          99.9th: 358        (2901 samples)
          min=1, max=3437
Request Latencies percentiles (usec) runtime 30 (s) (324719 total samples)
          50.0th: 13296      (98116 samples)
          90.0th: 14672      (129660 samples)
        * 99.0th: 25120      (28649 samples)
          99.9th: 38720      (2908 samples)
          min=7319, max=84789
RPS percentiles (requests) runtime 30 (s) (25 total samples)
          20.0th: 12752      (5 samples)
        * 50.0th: 13040      (9 samples)
          90.0th: 13200      (11 samples)
          min=12664, max=13204
average rps: 13003.57

vs eevdf

Wakeup Latencies percentiles (usec) runtime 30 (s) (324608 total samples)
          50.0th: 8          (107233 samples)
          90.0th: 45         (86607 samples)
        * 99.0th: 1262       (28725 samples)
          99.9th: 3916       (2928 samples)
          min=1, max=9487
Request Latencies percentiles (usec) runtime 30 (s) (325386 total samples)
          50.0th: 13136      (94018 samples)
          90.0th: 14384      (130630 samples)
        * 99.0th: 24736      (28822 samples)
          99.9th: 38720      (2917 samples)
          min=6538, max=91121
RPS percentiles (requests) runtime 30 (s) (25 total samples)
          20.0th: 12944      (9 samples)
        * 50.0th: 13008      (10 samples)
          90.0th: 13104      (5 samples)
          min=12852, max=13184
average rps: 13005.63

Using the following config:

[
  {
    "name": "hodgesd",
    "comment": "hodgesd user",
    "matches": [
      [
        {
          "UIDEquals": 12345
        }
      ]
    ],
    "kind": {
      "Grouped": {
        "util_range": [
          0.25,
          0.6
        ],
        "tickless": true
      }
    }
  },
  {
    "name": "stress-ng",
    "comment": "stress-ng slice",
    "matches": [
      [
        {
          "CommPrefix": "stress-ng"
        }
      ],
      [
        {
          "PcommPrefix": "stress-ng"
        }
      ]
    ],
    "kind": {
      "Confined": {
        "util_range": [
          0.05,
          0.80
        ],
        "slice_us": 2000
      }
    }
  },
  {
    "name": "normal",
    "comment": "the rest",
    "matches": [
      []
    ],
    "kind": {
      "Open": {
        "slice_us": 800
      }
    }
  }
]

struct layered_timer layered_timers[MAX_TIMERS] = {
{15LLU * NSEC_PER_SEC, CLOCK_BOOTTIME, 0},
{1LLU * NSEC_PER_SEC, CLOCK_BOOTTIME, 0},
{1LLU * NSEC_PER_MSEC, CLOCK_BOOTTIME, 0},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was chosen arbitrarily, should probably be more configurable but makes working with timers a little annoying

@hodgesds hodgesds requested a review from arighi June 25, 2025 20:00
@xerothermic
Copy link
Contributor

I'm curious if we boot the kernel with proper tickless options + isolcpus. Could we in theory run our selected user space tasks indefinitely? Or is there still some kernel threads that we'd need to yield once a while to avoid stalls?

}

bpf_for(layer_id, 0, nr_layers) {
if (cpuc->layer_id != layer_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu_ctx also has a task_layer_id field as well, but I think this is right because it's the cpu "owned" layer

@hodgesds
Copy link
Contributor Author

I'm curious if we boot the kernel with proper tickless options + isolcpus. Could we in theory run our selected user space tasks indefinitely? Or is there still some kernel threads that we'd need to yield once a while to avoid stalls?

I think that would be something that still needs to be tested, but IIRC affinitized kthreads can preempt.

Copy link
Contributor

@arighi arighi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but LGTM

return true;
}

static void tick_layer(struct cpu_ctx *cpuc, u32 layer_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe preempt_layer() to make it clear that it's not a tick callback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tickless_kick_layer()?

{
u64 dsq_id = layer_dsq_id(layer_id, cpuc->llc_id);
if (!scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpuc->cpu) &&
!scx_bpf_dsq_nr_queued(dsq_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also want to skip the kick if the CPU is idle or if the task running on it hasn't used enough time slice, but the code is simpler as it is and it's probably fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it also consider fallback DSQs?


if (!enable_antistall)
return true;
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a separate patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh @hodgesds this change is noop right? i.e. that's just the return to the timer callback which does nothing IIUC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, can make it a separate patch... it prevents the timer from rerunning if antistall is disabled.

Copy link
Contributor

@likewhatevs likewhatevs Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suspect the option to disable antistall ought to be removed btw

  • also lol thx for letting me know that wasn't noop, thought return true was just verifier noise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the timer interface in #2266. If that makes sense then I'll use that to make the tickless interval configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suspect the option to disable antistall ought to be removed btw

* also lol thx for letting me know that wasn't noop, thought return true was just verifier noise.

Let's not do that, disabling antistall is very useful when validating a new scheduler release behaves, as the kicks are much better than tanking performance on a machine by 90% (or whatever) but staying running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return true;
}

static void tick_layer(struct cpu_ctx *cpuc, u32 layer_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tickless_kick_layer()?

{
u64 dsq_id = layer_dsq_id(layer_id, cpuc->llc_id);
if (!scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpuc->cpu) &&
!scx_bpf_dsq_nr_queued(dsq_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it also consider fallback DSQs?

@likewhatevs likewhatevs requested a review from kkdwivedi June 25, 2025 23:53
@hodgesds hodgesds force-pushed the layered-tickless branch 4 times, most recently from f0b2522 to 7f2bc90 Compare June 27, 2025 08:05
@hodgesds
Copy link
Contributor Author

Rebased with the timer changes and borrowed some of the configuration from scx_tickless. This implementation might be a bit aggressive with kicking, but that's probably a safer way of doing things.

@hodgesds hodgesds force-pushed the layered-tickless branch 2 times, most recently from 9bc93be to 759119b Compare July 2, 2025 16:46
@hodgesds
Copy link
Contributor Author

hodgesds commented Jul 2, 2025

Updated to use @arighi 's approach of preempting the tickless task during enqueue and no stalls when running with stress-ng to test for stalls:

 stress-ng -t 45 --aggressive -M -c `nproc` --affinity-sleep 1 --affinity-rand --affinity 100 --affinity-delay 1 --affinity-pin

perf tests:

Wakeup Latencies percentiles (usec) runtime 30 (s) (392439 total samples)
         50.0th: 13         (99737 samples)
         90.0th: 32         (157179 samples)
       * 99.0th: 195        (34664 samples)
         99.9th: 472        (3497 samples)
         min=1, max=21448
Request Latencies percentiles (usec) runtime 30 (s) (393086 total samples)
         50.0th: 13520      (114040 samples)
         90.0th: 14416      (158790 samples)
       * 99.0th: 15440      (33131 samples)
         99.9th: 26208      (3427 samples)
         min=7006, max=49908
RPS percentiles (requests) runtime 30 (s) (31 total samples)
         20.0th: 13072      (9 samples)
       * 50.0th: 13136      (17 samples)
         90.0th: 13168      (2 samples)
         min=12640, max=13210
average rps: 13102.87

without tickless shows similar performance:

Wakeup Latencies percentiles (usec) runtime 30 (s) (393181 total samples)
          50.0th: 12         (107217 samples)
          90.0th: 28         (151138 samples)
        * 99.0th: 149        (34415 samples)
          99.9th: 406        (3535 samples)
          min=1, max=6449
Request Latencies percentiles (usec) runtime 30 (s) (393779 total samples)
          50.0th: 13616      (114923 samples)
          90.0th: 14416      (157620 samples)
        * 99.0th: 15184      (35127 samples)
          99.9th: 16544      (3371 samples)
          min=7385, max=28919
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 13072      (7 samples)
        * 50.0th: 13104      (11 samples)
          90.0th: 13200      (13 samples)
          min=12986, max=13197
average rps: 13125.97

vs eevdf:

Wakeup Latencies percentiles (usec) runtime 30 (s) (386557 total samples)
          50.0th: 8          (126273 samples)
          90.0th: 46         (103150 samples)
        * 99.0th: 2042       (34255 samples)
          99.9th: 3964       (3478 samples)
          min=1, max=14768
Request Latencies percentiles (usec) runtime 30 (s) (387364 total samples)
          50.0th: 13136      (115963 samples)
          90.0th: 14480      (155486 samples)
        * 99.0th: 27680      (33933 samples)
          99.9th: 42560      (3456 samples)
          min=6537, max=89863
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 12752      (7 samples)
        * 50.0th: 12912      (10 samples)
          90.0th: 13040      (12 samples)
          min=12551, max=13088
average rps: 12912.13

@hodgesds hodgesds force-pushed the layered-tickless branch 2 times, most recently from 9e086a7 to 4e88203 Compare July 2, 2025 16:53
Add a option to enable tickless scheduling on a layer. This may improve
throughput in certain scenarios by reducing the number of context
switches.

Signed-off-by: Daniel Hodges <hodgesd@meta.com>
@hodgesds hodgesds force-pushed the layered-tickless branch from 4e88203 to 53e3b17 Compare July 2, 2025 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants