Skip to content
killown edited this page Apr 1, 2026 · 2 revisions

Software Phase-Locked Loop - Technical Wiki

Relevant source file: src/pll.rs
Enabled by: --pll CLI flag
Best used with: --mode mailbox or --mode immediate


Table of Contents

  1. Goal
  2. Backgroun - Why Naive Rendering Fails
  3. Architecture Overview
  4. The Vblank Grid
  5. Hardware Phase Anchoring
  6. Render Budget Estimation
  7. Deadline Computation
  8. The PI Controller
  9. Lock State Machine
  10. Sleeping clock_nanosleep(TIMER_ABSTIME)
  11. Dropped Frame Handling
  12. Sync Score and Phase Drift
  13. Tuning Constants Reference
  14. Data Flow Per Frame
  15. Present Mode Compatibility
  16. Frame Log Fields
  17. Limitations

1. Goal

Lock the render loop to exactly one frame per monitor vblank period and attempt to deliver each frame just-in-time (JIT). The architecture uses a predictive PI controller to shift the submission phase as close to the vblank edge as the system allows.

The "50% Phase" Bottleneck

Most Linux compositors are hardcoded to target a phase offset of ~50% of the vblank period. This is a safety guardband: if you deliver a frame "too late" (e.g., at 90% phase), the compositor may reject it for the current cycle and delay it by an entire frame.

In these environments, the PLL cannot "force" a sync_score of 100. Instead, it serves to:

  1. Identify the Guardband: Find the exact point where the compositor begins to drop or delay frames.
  2. Expose the Latency Tax: Measure the difference between the hardware vblank and the earliest phase the compositor is willing to accept.
  3. Maximise Stability: Lock to the highest stable phase allowed, eliminating the random jitter of uncapped rendering.

Measurable Outcomes

Metric Without --pll With --pll (Locked)
FPS Uncapped (GPU-limited) Locked to Refresh Rate
sync_score ~48 (Random/Default) Max stable allowed by Compositor
phase_drift_ms ± half-period Minimal/Deterministic
flip_latency_ms High/Unpredictable Stable (reveals Compositor overhead)

2. Background - Why Naive Rendering Fails

A display scans out one frame per vblank period T. Without rate limiting, a GPU that renders in 3 ms will submit ~4 frames per period at 120 Hz, the compositor queues them and presents one per vblank regardless, so the FPS seen by the application does not match what the user actually sees.

Even with vsync (PresentMode::Fifo), the phase of submission is uncontrolled. The frame may be submitted with 7 ms to spare before the vblank or 0.1 ms, the compositor holds it in a buffer and presents it at the next vblank either way. This means sync_score is essentially random: it measures where the finished frame happened to arrive relative to the hardware clock, and without deliberate phase control that is ~50% of a period off on average.

The PLL fixes both problems: it limits rate and controls phase.


3. Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        render() loop                            │
│                                                                 │
│  ┌──────────────┐   flip_ns    ┌──────────────────────────┐     │
│  │ FlipTracker  │ ──────────►  │     PacingAnalyzer       │     │
│  │ (DRM epoll)  │              │  phase_drift_ns          │     │
│  └──────────────┘              │  last_vblank_mul         │     │
│         │                      └────────────┬─────────────┘     │
│         │ flip_ns                           │                   │
│         ▼                                   ▼                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                   PllController                          │   │
│  │                                                          │   │
│  │  1. Update render budget EMA (gpu_time_ms or cpu_ms)     │   │
│  │  2. Advance vblank grid (anchor to flip_ns if fresh)     │   │
│  │  3. PI correction on phase_drift_ns                      │   │
│  │  4. deadline = next_vblank - budget - PI_correction      │   │
│  └───────────────────────────┬──────────────────────────────┘   │
│                              │ deadline_ns                      │
│                              ▼                                  │
│                    sleep_until(deadline_ns)                     │
│                    [clock_nanosleep TIMER_ABSTIME]              │
│                              │                                  │
│                              ▼                                  │
│                    get_current_texture()                        │
│                    queue.submit()                               │
│                    output.present()                             │
└─────────────────────────────────────────────────────────────────┘

The PLL runs entirely on the render thread. No background threads are involved beyond the existing FlipTracker epoll loop that delivers hardware vblank timestamps.


4. The Vblank Grid

The display refreshes on a fixed grid:

vblank[n] = anchor + n × ideal_period_ns

ideal_period_ns is derived from the monitor's reported refresh rate via the --connector DRM query (preferred) or winit's refresh_rate_millihertz() (fallback):

ideal_period_ns = (1000.0 / refresh_hz) × 1_000_000  [nanoseconds]

Examples:

Refresh rate ideal_period_ns
60 Hz 16 666 667 ns
120 Hz 8 333 333 ns
165 Hz 6 060 606 ns
240 Hz 4 166 667 ns

PllController maintains next_vblank_ns - the absolute CLOCK_MONOTONIC timestamp of the vblank the current frame is targeting. Each frame this is advanced by ideal_period_ns × vblank_mul, keeping the grid in sync with hardware even when frames are dropped.


5. Hardware Phase Anchoring

This is the most critical correctness property.

The problem with a floating origin

The naive approach anchors the phase grid to the WSI presentation timestamp of the first frame:

origin = ts_ns[first_frame]
ideal[n] = origin + n × ideal_period_ns
drift[n] = ts_ns[n] - ideal[n]

If ts_ns[first_frame] lands 4 ms after a hardware vblank edge, then ideal[n] has that 4 ms offset baked in for the entire session. A frame that presents perfectly on a hardware vblank still shows drift = 4 ms and sync_score ≈ 50. The PLL then corrects toward the wrong target - it converges the presentation clock to the floating grid, not to the hardware clock.

The fix - DRM_EVENT_FLIP_COMPLETE

FlipTracker delivers flip_ns: the CLOCK_MONOTONIC nanosecond timestamp of the moment the kernel drove the frame to the display (the actual hardware vblank edge). This is the ground truth.

Both PacingAnalyzer and PllController accept this timestamp and use it to anchor their grids on the first delivery:

In PacingAnalyzer::push():

// Store the first flip timestamp as the hardware anchor.
if self.hw_vblank_anchor_ns.is_none() {
    self.hw_vblank_anchor_ns = Some(flip_ns);
    self.phase_origin_ns = None; // force recompute from hardware anchor
}

// Phase origin = hardware vblank immediately before ts_ns.
let origin = match self.hw_vblank_anchor_ns {
    Some(anchor) => {
        let elapsed = ts_ns.saturating_sub(anchor);
        let vblank_count = elapsed / self.ideal_period_ns;
        anchor + vblank_count * self.ideal_period_ns
    }
    None => ts_ns, // floating fallback
};

In PllController::compute_deadline():

// Seed next_vblank_ns from the hardware clock, not from now % period.
None => {
    if let Some(flip_ns) = hw_vblank_ns {
        let periods_elapsed = now_ns.saturating_sub(flip_ns) / self.ideal_period_ns;
        flip_ns + (periods_elapsed + 1) * self.ideal_period_ns
    } else {
        // Fallback: snap to next multiple of period from now.
        now_ns + period - (now_ns % period)
    }
}

After anchoring, drift = 0 means the WSI presentation timestamp landed exactly on a hardware vblank edge. sync_score = 100 becomes an achievable target rather than a theoretical maximum.


6. Render Budget Estimation

The deadline must be set early enough for the GPU to finish rendering before the vblank edge. The budget is estimated dynamically:

measured_ns  = gpu_time_ms × 1_000_000        [from TIMESTAMP_QUERY, preferred]
             | cpu_frame_ms × 1_000_000        [fallback when GPU timestamps absent]

with_margin  = measured_ns × (1 + 0.20)       [20% safety margin]

budget_ema   = 0.15 × with_margin
             + 0.85 × budget_ema_prev          [EMA, α = 0.15]

render_budget_ns = clamp(budget_ema, period/10, period×8/10)

The EMA smoothing factor α = 0.15 gives a half-life of approximately 4 frames, tracking gradual load changes (more cubes, shader complexity) without overreacting to isolated GPU spikes from thermal throttle or TTM eviction.

The clamp [period/10, period×80%] prevents two failure modes:

  • Below 10% - the deadline is so close to the vblank edge that clock_nanosleep wakeup jitter (~50–100 µs) would cause a miss on almost every frame.
  • Above 80% - the deadline is so early that the submit happens while the previous buffer is still being composited, introducing unnecessary buffer-hold latency.

The budget starts at 70% of ideal_period_ns on construction, a conservative default that prevents the first few frames from missing their vblanks before the EMA has seen real GPU timing data.


7. Deadline Computation

Each frame the full computation is:

deadline[n] = next_vblank[n] − render_budget_ns − PI_correction

In code:

let deadline_ns = if raw_correction >= 0 {
    next_vblank
        .saturating_sub(self.render_budget_ns)
        .saturating_sub(raw_correction as u64)
} else {
    next_vblank
        .saturating_sub(self.render_budget_ns)
        .saturating_add(raw_correction.unsigned_abs())
};

The sign convention:

  • Positive PI correction (frame was late, positive drift) → subtract from deadline → submit earlier next frame → presentation timestamp moves earlier toward the vblank edge.
  • Negative PI correction (frame was early, negative drift) → add to deadline → submit slightly later → prevents overcorrecting past zero.

If deadline_ns <= now_ns when compute_deadline is called, the frame is already running late. The sleep is skipped and the submit happens immediately. PacingAnalyzer will record vblank_mul > 1 for this frame and advance the grid accordingly on the next call.


8. The PI Controller

The controller uses a discrete-time Proportional-Integral structure:

e[n]        = phase_drift_ns[n-1]          (previous frame's signed drift)
integrator  = clamp(integrator + e[n], −clamp_ns, +clamp_ns)

P_term      = Kp × e[n]
I_term      = Ki × integrator
correction  = P_term + I_term

Why PI and not PD or PID

  • P alone - leaves a steady-state offset equal to fixed_gpu_latency / Kp. The GPU always takes some fixed time (driver overhead, PCIe DMA) that is not a round multiple of ideal_period_ns. P cannot eliminate this offset without infinite gain.
  • I term - integrates the offset away over time. At Ki = 0.02 and 120 Hz, a 1 ms fixed latency is eliminated in approximately 50 frames (~400 ms).
  • D term - amplifies measurement noise. phase_drift_ns has kernel timer jitter of ±50–100 µs, a derivative would see large instantaneous rate-of-change from jitter and inject corrections larger than the drift itself. Dropped.

Anti-windup

The integrator is clamped to ±INTEGRATOR_CLAMP_NS (one 60 Hz vblank period, 16 666 667 ns) before computing I_term:

self.integrator_ns = (self.integrator_ns + e)
    .clamp(-INTEGRATOR_CLAMP_NS, INTEGRATOR_CLAMP_NS);

Without this, a burst of dropped frames (thermal throttle, VT switch recovery) winds the integrator to hundreds of milliseconds. After recovery the controller would apply a massive correction, sleeping for nearly an entire second before the first frame, causing a visible freeze. The clamp limits the integrator's maximum contribution to one frame of correction regardless of how many frames are missed.


9. Lock State Machine

                  |drift| < 0.5 ms
                  for 8 consecutive frames
  Acquiring ──────────────────────────────► Locked
      ▲                                       │
      │    |drift| >= 0.5 ms                  │
      └───────────────────────────────────────┘
            (integrator reset on transition)
State Kp Ki Description
Acquiring 0.5 0.02 Full gains, actively correcting
Locked 0.25 0.01 Half gains, stable tracking

Halving gains in Locked mode prevents the controller from injecting unnecessary jitter into an already-stable loop. The threshold of 0.5 ms (CONVERGENCE_WINDOW_NS = 500_000 ns) was chosen to be safely above kernel timer jitter (~100 µs) but narrow enough that the controller does not enter tracking mode prematurely on a loop that is still drifting slowly.

When the phase escapes the convergence window while in Locked state, the integrator is immediately reset and the state returns to Acquiring. Without this reset, the accumulated integral from the stable period would fight the correction needed to re-acquire, causing the loop to oscillate around the new equilibrium for many frames before settling.


10. Sleeping - clock_nanosleep(TIMER_ABSTIME)

libc::clock_nanosleep(
    libc::CLOCK_MONOTONIC,
    libc::TIMER_ABSTIME,
    &target,      // absolute CLOCK_MONOTONIC nanoseconds
    null_mut(),   // no remainder needed for ABSTIME
)

Three properties make this correct:

1. Absolute time, not relative duration.
TIMER_ABSTIME specifies a wakeup instant rather than a sleep duration. If the call itself takes 10 µs to enter the kernel (syscall overhead, context switch), that 10 µs is not added to the sleep, the wakeup still targets the same absolute instant. A relative nanosleep(duration) would add that overhead to every frame, accumulating across the session.

2. Same epoch as WSI timestamps.
Both clock_nanosleep(CLOCK_MONOTONIC) and adapter.get_presentation_timestamp() (on Vulkan/DRM backends) are in CLOCK_MONOTONIC nanoseconds. The deadline computed from next_vblank_ns (derived from flip_ns, which is also CLOCK_MONOTONIC) can be compared directly to now_ns without any epoch conversion.

3. EINTR handling.
If a Unix signal is delivered during the sleep, clock_nanosleep returns EINTR before the target time. This is non-fatal, the controller will observe a phase error on the next frame and self-correct. No retry loop is needed.

Minimum sleep threshold:
Sleeps shorter than MIN_SLEEP_NS = 100_000 ns (100 µs) are skipped entirely. On a tickless kernel (CONFIG_HZ_1000), the kernel timer wheel has ~1 ms granularity but wakeup jitter is typically 50–100 µs. A 10 µs sleep would overshoot by 5–10×, injecting more jitter than it removes. Sub-threshold corrections still accumulate in the integrator and are applied on the next frame.


11. Dropped Frame Handling

When the GPU overruns its budget and misses a vblank, PacingAnalyzer reports vblank_mul > 1. The render loop reacts in two ways:

1. Integrator reset:

if self.any_vblank_miss_this_frame {
    ctrl.reset(); // clears next_vblank_ns, integrator, lock_count
}

The integrator may contain a large accumulated correction from the overrun frames. Resetting prevents that wound-up value from producing an overcorrection (a very early submission) on the recovery frame.

2. Grid advance by vblank_mul:

Some(prev) => prev + self.ideal_period_ns * mul,
// mul = last_vblank_mul.max(1) as u64

If the previous frame consumed 2 vblank periods, the grid advances by 2 × ideal_period_ns. Without this, the grid would fall one period behind the hardware after each drop, and the controller would target a vblank that has already occurred, resulting in a permanently late phase.


12. Sync Score and Phase Drift

sync_score is computed in PacingAnalyzer::push():

let drift_ns = ts_ns as i64 - ideal_ts_ns as i64;  // signed
let drift_ms = drift_ns as f32 / 1_000_000.0;
let half_period_ms = ideal_ms / 2.0;
let sync_score = (100.0 * (1.0 - drift_ms.abs() / half_period_ms))
    .clamp(0.0, 100.0);

Interpretation:

sync_score phase_drift_ms at 120 Hz Meaning
100 0 ms Frame landed exactly on a hardware vblank edge
95 ±0.2 ms Perceptually indistinguishable from perfect
80 ±0.8 ms Slight phase offset, compositor holds ≈1 frame extra
50 ±2.1 ms Half-period offset - worst case
0 ±4.2 ms Maximum possible drift

Before hardware anchoring was implemented, a sync_score of ~48 was normal because the grid was anchored to the first WSI frame rather than a hardware vblank edge. The grid's zero point was arbitrary, so even perfectly consistent frames scored near 50. After anchoring, 90+ is achievable on a well-configured system.


13. Tuning Constants Reference

Constant Value Description
KP 0.5 Proportional gain. 1 ms drift → 0.5 ms deadline shift. Converges in ~4 frames.
KI 0.02 Integral gain. Eliminates fixed latency offset in ~50 frames at 120 Hz.
INTEGRATOR_CLAMP_NS 16_666_667 ns Anti-windup clamp (one 60 Hz period). Limits max integral contribution to one frame of correction.
BUDGET_SAFETY_MARGIN 0.20 20% headroom added on top of GPU EMA for driver flip pipeline and DMA.
BUDGET_EMA_ALPHA 0.15 EMA smoothing for render budget. Half-life ≈ 4 frames.
MIN_SLEEP_NS 100_000 ns Minimum sleep threshold (100 µs). Below this, kernel timer jitter exceeds the correction value.
CONVERGENCE_WINDOW_NS 500_000 ns Lock detection window (0.5 ms). Frames within this threshold count toward lock.
LOCK_THRESHOLD_FRAMES 8 Consecutive on-time frames required to enter Locked tracking mode.
Initial render budget 70% of period Conservative seed before EMA has real GPU data.
Budget clamp min 10% of period Floor to prevent deadline from being too close to vblank edge.
Budget clamp max 80% of period Ceiling to prevent deadline from being so early that the compositor holds the buffer.
Locked-mode gain 0.5 × {KP, KI} Half gains in tracking mode to reduce jitter injection into a stable loop.

14. Data Flow Per Frame

render() top of frame
│
├─ gpu_timer.poll()                      [read GPU timestamps from previous frame]
│
├─ flip_tracker.queue.pop_front()        [drain DRM_EVENT_FLIP_COMPLETE]
│   └─ flip_record → {flip_ns, sequence}
│       ├─ hw_vblank_ns = flip_ns        [for grid anchoring]
│       └─ flip_latency_ms              [for frame log]
│
├─ PllController::compute_deadline(
│       phase_drift_ns  ← PacingAnalyzer::last_phase_drift_ns()
│       last_vblank_mul ← PacingAnalyzer::last_vblank_mul()
│       hw_vblank_ns    ← flip_ns
│       gpu_time_ms     ← gpu_timer.last_gpu_time_ms()
│       cpu_frame_ms    ← self.prev_cpu_frame_ms
│   )
│   ├─ update render_budget EMA
│   ├─ advance next_vblank_ns (anchor to flip_ns on first call)
│   ├─ PI correction on phase_drift_ns
│   └─ deadline = next_vblank - budget - PI_correction
│
├─ sleep_until(deadline_ns)              [clock_nanosleep CLOCK_MONOTONIC TIMER_ABSTIME]
│
├─ get_current_texture()                 [submit point, rate-limited by sleep above]
├─ encode render pass
├─ queue.submit()
├─ output.present()
│
├─ prev_cpu_frame_ms = cpu_frame_ms     [store for next frame's budget]
│
└─ PacingAnalyzer::push(
       ts_ns           ← adapter.get_presentation_timestamp()
       hw_vblank_ns    ← flip_ns
       cpu_frame_ms
       cpu_submit_ns
       gpu_time_ms
       flip_latency_ms
   )
   ├─ anchor phase grid to hw_vblank_ns on first delivery
   ├─ compute drift_ns = ts_ns - ideal_ts_ns (signed)
   ├─ sync_score = 100 × (1 - |drift_ms| / half_period_ms)
   └─ store last_phase_drift_ns, last_vblank_mul  [for next frame's PLL input]

15. Present Mode Compatibility

--mode mailbox or --mode immediate (recommended)

The driver does not block inside get_current_texture(). The pre-submit sleep has direct control over the submit instant. The PLL achieves full rate-limiting and phase alignment. Use these modes with --pll for best results.

--mode fifo

The driver blocks inside get_current_texture() until the next vblank, consuming most or all of the sleep budget. The application cannot submit faster than one-per-vblank regardless (the driver enforces this), so the FPS cap is still effective. However, the precise phase alignment is less controllable because the wakeup point is determined by the driver, not by the PLL. A startup warning is printed when --pll --mode fifo is detected.


16. Frame Log Fields

When --pll and --frame-log are both active, each NDJSON row gains five additional fields:

Field Type Description
pll_error_ns i64 Phase error fed into this frame's PI iteration. Positive = late.
pll_sleep_ns u64 Sleep duration actually issued. 0 when deadline passed or correction < 100 µs.
pll_deadline_ns u64 Absolute CLOCK_MONOTONIC target the PLL slept until. Useful for verifying grid alignment against ts_ns.
pll_budget_ns u64 Render budget EMA at this frame. Divide by 1 000 000 for milliseconds.
pll_lock 0 or 1 1 when the controller is in Locked (tracking) mode.

Example row (120 Hz, PLL locked):

{
  "schema": 5,
  "frame": 1240,
  "ts_ns": 548293810000,
  "delta_ms": 8.3330,
  "ideal_ms": 8.3333,
  "drift_ms": 0.0821,
  "sync": 98.04,
  "pll_error_ns": 82100,
  "pll_sleep_ns": 5100000,
  "pll_deadline_ns": 548285700000,
  "pll_budget_ns": 3200000,
  "pll_lock": 1
}

17. Limitations

Wayland compositor scheduling policy.
Even with a perfectly-timed submit, the compositor may hold a finished buffer across a vblank (Mutter's max_render_time, KWin's buffer-age logic). The PLL corrects for constant compositor latency via the integral term, but cannot compensate for variable hold times that change frame-to-frame. flip_latency_ms in the frame log exposes when this is the bottleneck rather than the GPU or PLL.

No direct scanout / KMS bypass.
True "racing the beam" (writing pixels as the electron beam scans) is impossible in userspace on a compositor-mediated display stack. The compositor owns the KMS fd exclusively. The PLL achieves the closest approximation available to unprivileged processes: presenting the buffer to the compositor at the optimal moment so it is available at the next vblank without excess queuing.

FlipTracker dependency for full accuracy.
The hardware phase anchor requires DRM_EVENT_FLIP_COMPLETE timestamps from FlipTracker. Without DRM access, the grid falls back to a floating WSI origin and sync_score measures consistency rather than absolute phase. The PLL still rate-limits correctly in this mode but may not converge to the hardware vblank edge.

Kernel timer granularity.
clock_nanosleep on a tickless kernel (CONFIG_HZ_1000) has an effective wakeup jitter of approximately 50–100 µs. The minimum sleep threshold of 100 µs exists for this reason. At very high refresh rates (240 Hz, period = 4.2 ms), this jitter represents ~2.4% of the frame budget, setting a floor on achievable phase_drift_ms.

One-frame-lag feedback.
The phase error fed into the PI controller is always from the previous frame's presentation timestamp, the current frame's timestamp is not available until after present() returns, which is after the sleep that needs the correction. This one-frame lag is intrinsic to the control loop structure and is handled by the integrator term, which accumulates the offset across frames.

Clone this wiki locally