-
Notifications
You must be signed in to change notification settings - Fork 0
PLL
Relevant source file:
src/pll.rs
Enabled by:--pllCLI flag
Best used with:--mode mailboxor--mode immediate
- Goal
- Backgroun - Why Naive Rendering Fails
- Architecture Overview
- The Vblank Grid
- Hardware Phase Anchoring
- Render Budget Estimation
- Deadline Computation
- The PI Controller
- Lock State Machine
- Sleeping
clock_nanosleep(TIMER_ABSTIME) - Dropped Frame Handling
- Sync Score and Phase Drift
- Tuning Constants Reference
- Data Flow Per Frame
- Present Mode Compatibility
- Frame Log Fields
- Limitations
Lock the render loop to exactly one frame per monitor vblank period and attempt to deliver each frame just-in-time (JIT). The architecture uses a predictive PI controller to shift the submission phase as close to the vblank edge as the system allows.
Most Linux compositors are hardcoded to target a phase offset of ~50% of the vblank period. This is a safety guardband: if you deliver a frame "too late" (e.g., at 90% phase), the compositor may reject it for the current cycle and delay it by an entire frame.
In these environments, the PLL cannot "force" a sync_score of 100. Instead, it serves to:
- Identify the Guardband: Find the exact point where the compositor begins to drop or delay frames.
- Expose the Latency Tax: Measure the difference between the hardware vblank and the earliest phase the compositor is willing to accept.
- Maximise Stability: Lock to the highest stable phase allowed, eliminating the random jitter of uncapped rendering.
| Metric | Without --pll
|
With --pll (Locked) |
|---|---|---|
| FPS | Uncapped (GPU-limited) | Locked to Refresh Rate |
sync_score |
~48 (Random/Default) | Max stable allowed by Compositor |
phase_drift_ms |
± half-period | Minimal/Deterministic |
flip_latency_ms |
High/Unpredictable | Stable (reveals Compositor overhead) |
A display scans out one frame per vblank period T. Without rate limiting, a GPU that renders in 3 ms will submit ~4 frames per period at 120 Hz, the compositor queues them and presents one per vblank regardless, so the FPS seen by the application does not match what the user actually sees.
Even with vsync (PresentMode::Fifo), the phase of submission is uncontrolled. The frame may be submitted with 7 ms to spare before the vblank or 0.1 ms, the compositor holds it in a buffer and presents it at the next vblank either way. This means sync_score is essentially random: it measures where the finished frame happened to arrive relative to the hardware clock, and without deliberate phase control that is ~50% of a period off on average.
The PLL fixes both problems: it limits rate and controls phase.
┌─────────────────────────────────────────────────────────────────┐
│ render() loop │
│ │
│ ┌──────────────┐ flip_ns ┌──────────────────────────┐ │
│ │ FlipTracker │ ──────────► │ PacingAnalyzer │ │
│ │ (DRM epoll) │ │ phase_drift_ns │ │
│ └──────────────┘ │ last_vblank_mul │ │
│ │ └────────────┬─────────────┘ │
│ │ flip_ns │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ PllController │ │
│ │ │ │
│ │ 1. Update render budget EMA (gpu_time_ms or cpu_ms) │ │
│ │ 2. Advance vblank grid (anchor to flip_ns if fresh) │ │
│ │ 3. PI correction on phase_drift_ns │ │
│ │ 4. deadline = next_vblank - budget - PI_correction │ │
│ └───────────────────────────┬──────────────────────────────┘ │
│ │ deadline_ns │
│ ▼ │
│ sleep_until(deadline_ns) │
│ [clock_nanosleep TIMER_ABSTIME] │
│ │ │
│ ▼ │
│ get_current_texture() │
│ queue.submit() │
│ output.present() │
└─────────────────────────────────────────────────────────────────┘
The PLL runs entirely on the render thread. No background threads are involved beyond the existing FlipTracker epoll loop that delivers hardware vblank timestamps.
The display refreshes on a fixed grid:
vblank[n] = anchor + n × ideal_period_ns
ideal_period_ns is derived from the monitor's reported refresh rate via the --connector DRM query (preferred) or winit's refresh_rate_millihertz() (fallback):
ideal_period_ns = (1000.0 / refresh_hz) × 1_000_000 [nanoseconds]
Examples:
| Refresh rate | ideal_period_ns |
|---|---|
| 60 Hz | 16 666 667 ns |
| 120 Hz | 8 333 333 ns |
| 165 Hz | 6 060 606 ns |
| 240 Hz | 4 166 667 ns |
PllController maintains next_vblank_ns - the absolute CLOCK_MONOTONIC timestamp of the vblank the current frame is targeting. Each frame this is advanced by ideal_period_ns × vblank_mul, keeping the grid in sync with hardware even when frames are dropped.
This is the most critical correctness property.
The naive approach anchors the phase grid to the WSI presentation timestamp of the first frame:
origin = ts_ns[first_frame]
ideal[n] = origin + n × ideal_period_ns
drift[n] = ts_ns[n] - ideal[n]
If ts_ns[first_frame] lands 4 ms after a hardware vblank edge, then ideal[n] has that 4 ms offset baked in for the entire session. A frame that presents perfectly on a hardware vblank still shows drift = 4 ms and sync_score ≈ 50. The PLL then corrects toward the wrong target - it converges the presentation clock to the floating grid, not to the hardware clock.
FlipTracker delivers flip_ns: the CLOCK_MONOTONIC nanosecond timestamp of the moment the kernel drove the frame to the display (the actual hardware vblank edge). This is the ground truth.
Both PacingAnalyzer and PllController accept this timestamp and use it to anchor their grids on the first delivery:
In PacingAnalyzer::push():
// Store the first flip timestamp as the hardware anchor.
if self.hw_vblank_anchor_ns.is_none() {
self.hw_vblank_anchor_ns = Some(flip_ns);
self.phase_origin_ns = None; // force recompute from hardware anchor
}
// Phase origin = hardware vblank immediately before ts_ns.
let origin = match self.hw_vblank_anchor_ns {
Some(anchor) => {
let elapsed = ts_ns.saturating_sub(anchor);
let vblank_count = elapsed / self.ideal_period_ns;
anchor + vblank_count * self.ideal_period_ns
}
None => ts_ns, // floating fallback
};In PllController::compute_deadline():
// Seed next_vblank_ns from the hardware clock, not from now % period.
None => {
if let Some(flip_ns) = hw_vblank_ns {
let periods_elapsed = now_ns.saturating_sub(flip_ns) / self.ideal_period_ns;
flip_ns + (periods_elapsed + 1) * self.ideal_period_ns
} else {
// Fallback: snap to next multiple of period from now.
now_ns + period - (now_ns % period)
}
}After anchoring, drift = 0 means the WSI presentation timestamp landed exactly on a hardware vblank edge. sync_score = 100 becomes an achievable target rather than a theoretical maximum.
The deadline must be set early enough for the GPU to finish rendering before the vblank edge. The budget is estimated dynamically:
measured_ns = gpu_time_ms × 1_000_000 [from TIMESTAMP_QUERY, preferred]
| cpu_frame_ms × 1_000_000 [fallback when GPU timestamps absent]
with_margin = measured_ns × (1 + 0.20) [20% safety margin]
budget_ema = 0.15 × with_margin
+ 0.85 × budget_ema_prev [EMA, α = 0.15]
render_budget_ns = clamp(budget_ema, period/10, period×8/10)
The EMA smoothing factor α = 0.15 gives a half-life of approximately 4 frames, tracking gradual load changes (more cubes, shader complexity) without overreacting to isolated GPU spikes from thermal throttle or TTM eviction.
The clamp [period/10, period×80%] prevents two failure modes:
-
Below 10% - the deadline is so close to the vblank edge that
clock_nanosleepwakeup jitter (~50–100 µs) would cause a miss on almost every frame. - Above 80% - the deadline is so early that the submit happens while the previous buffer is still being composited, introducing unnecessary buffer-hold latency.
The budget starts at 70% of ideal_period_ns on construction, a conservative default that prevents the first few frames from missing their vblanks before the EMA has seen real GPU timing data.
Each frame the full computation is:
deadline[n] = next_vblank[n] − render_budget_ns − PI_correction
In code:
let deadline_ns = if raw_correction >= 0 {
next_vblank
.saturating_sub(self.render_budget_ns)
.saturating_sub(raw_correction as u64)
} else {
next_vblank
.saturating_sub(self.render_budget_ns)
.saturating_add(raw_correction.unsigned_abs())
};The sign convention:
- Positive PI correction (frame was late, positive drift) → subtract from deadline → submit earlier next frame → presentation timestamp moves earlier toward the vblank edge.
- Negative PI correction (frame was early, negative drift) → add to deadline → submit slightly later → prevents overcorrecting past zero.
If deadline_ns <= now_ns when compute_deadline is called, the frame is already running late. The sleep is skipped and the submit happens immediately. PacingAnalyzer will record vblank_mul > 1 for this frame and advance the grid accordingly on the next call.
The controller uses a discrete-time Proportional-Integral structure:
e[n] = phase_drift_ns[n-1] (previous frame's signed drift)
integrator = clamp(integrator + e[n], −clamp_ns, +clamp_ns)
P_term = Kp × e[n]
I_term = Ki × integrator
correction = P_term + I_term
-
P alone - leaves a steady-state offset equal to
fixed_gpu_latency / Kp. The GPU always takes some fixed time (driver overhead, PCIe DMA) that is not a round multiple ofideal_period_ns. P cannot eliminate this offset without infinite gain. -
I term - integrates the offset away over time. At
Ki = 0.02and 120 Hz, a 1 ms fixed latency is eliminated in approximately 50 frames (~400 ms). -
D term - amplifies measurement noise.
phase_drift_nshas kernel timer jitter of ±50–100 µs, a derivative would see large instantaneous rate-of-change from jitter and inject corrections larger than the drift itself. Dropped.
The integrator is clamped to ±INTEGRATOR_CLAMP_NS (one 60 Hz vblank period, 16 666 667 ns) before computing I_term:
self.integrator_ns = (self.integrator_ns + e)
.clamp(-INTEGRATOR_CLAMP_NS, INTEGRATOR_CLAMP_NS);Without this, a burst of dropped frames (thermal throttle, VT switch recovery) winds the integrator to hundreds of milliseconds. After recovery the controller would apply a massive correction, sleeping for nearly an entire second before the first frame, causing a visible freeze. The clamp limits the integrator's maximum contribution to one frame of correction regardless of how many frames are missed.
|drift| < 0.5 ms
for 8 consecutive frames
Acquiring ──────────────────────────────► Locked
▲ │
│ |drift| >= 0.5 ms │
└───────────────────────────────────────┘
(integrator reset on transition)
| State | Kp | Ki | Description |
|---|---|---|---|
Acquiring |
0.5 | 0.02 | Full gains, actively correcting |
Locked |
0.25 | 0.01 | Half gains, stable tracking |
Halving gains in Locked mode prevents the controller from injecting unnecessary jitter into an already-stable loop. The threshold of 0.5 ms (CONVERGENCE_WINDOW_NS = 500_000 ns) was chosen to be safely above kernel timer jitter (~100 µs) but narrow enough that the controller does not enter tracking mode prematurely on a loop that is still drifting slowly.
When the phase escapes the convergence window while in Locked state, the integrator is immediately reset and the state returns to Acquiring. Without this reset, the accumulated integral from the stable period would fight the correction needed to re-acquire, causing the loop to oscillate around the new equilibrium for many frames before settling.
libc::clock_nanosleep(
libc::CLOCK_MONOTONIC,
libc::TIMER_ABSTIME,
&target, // absolute CLOCK_MONOTONIC nanoseconds
null_mut(), // no remainder needed for ABSTIME
)Three properties make this correct:
1. Absolute time, not relative duration.
TIMER_ABSTIME specifies a wakeup instant rather than a sleep duration. If the call itself takes 10 µs to enter the kernel (syscall overhead, context switch), that 10 µs is not added to the sleep, the wakeup still targets the same absolute instant. A relative nanosleep(duration) would add that overhead to every frame, accumulating across the session.
2. Same epoch as WSI timestamps.
Both clock_nanosleep(CLOCK_MONOTONIC) and adapter.get_presentation_timestamp() (on Vulkan/DRM backends) are in CLOCK_MONOTONIC nanoseconds. The deadline computed from next_vblank_ns (derived from flip_ns, which is also CLOCK_MONOTONIC) can be compared directly to now_ns without any epoch conversion.
3. EINTR handling.
If a Unix signal is delivered during the sleep, clock_nanosleep returns EINTR before the target time. This is non-fatal, the controller will observe a phase error on the next frame and self-correct. No retry loop is needed.
Minimum sleep threshold:
Sleeps shorter than MIN_SLEEP_NS = 100_000 ns (100 µs) are skipped entirely. On a tickless kernel (CONFIG_HZ_1000), the kernel timer wheel has ~1 ms granularity but wakeup jitter is typically 50–100 µs. A 10 µs sleep would overshoot by 5–10×, injecting more jitter than it removes. Sub-threshold corrections still accumulate in the integrator and are applied on the next frame.
When the GPU overruns its budget and misses a vblank, PacingAnalyzer reports vblank_mul > 1. The render loop reacts in two ways:
1. Integrator reset:
if self.any_vblank_miss_this_frame {
ctrl.reset(); // clears next_vblank_ns, integrator, lock_count
}The integrator may contain a large accumulated correction from the overrun frames. Resetting prevents that wound-up value from producing an overcorrection (a very early submission) on the recovery frame.
2. Grid advance by vblank_mul:
Some(prev) => prev + self.ideal_period_ns * mul,
// mul = last_vblank_mul.max(1) as u64If the previous frame consumed 2 vblank periods, the grid advances by 2 × ideal_period_ns. Without this, the grid would fall one period behind the hardware after each drop, and the controller would target a vblank that has already occurred, resulting in a permanently late phase.
sync_score is computed in PacingAnalyzer::push():
let drift_ns = ts_ns as i64 - ideal_ts_ns as i64; // signed
let drift_ms = drift_ns as f32 / 1_000_000.0;
let half_period_ms = ideal_ms / 2.0;
let sync_score = (100.0 * (1.0 - drift_ms.abs() / half_period_ms))
.clamp(0.0, 100.0);Interpretation:
sync_score |
phase_drift_ms at 120 Hz |
Meaning |
|---|---|---|
| 100 | 0 ms | Frame landed exactly on a hardware vblank edge |
| 95 | ±0.2 ms | Perceptually indistinguishable from perfect |
| 80 | ±0.8 ms | Slight phase offset, compositor holds ≈1 frame extra |
| 50 | ±2.1 ms | Half-period offset - worst case |
| 0 | ±4.2 ms | Maximum possible drift |
Before hardware anchoring was implemented, a sync_score of ~48 was normal because the grid was anchored to the first WSI frame rather than a hardware vblank edge. The grid's zero point was arbitrary, so even perfectly consistent frames scored near 50. After anchoring, 90+ is achievable on a well-configured system.
| Constant | Value | Description |
|---|---|---|
KP |
0.5 |
Proportional gain. 1 ms drift → 0.5 ms deadline shift. Converges in ~4 frames. |
KI |
0.02 |
Integral gain. Eliminates fixed latency offset in ~50 frames at 120 Hz. |
INTEGRATOR_CLAMP_NS |
16_666_667 ns |
Anti-windup clamp (one 60 Hz period). Limits max integral contribution to one frame of correction. |
BUDGET_SAFETY_MARGIN |
0.20 |
20% headroom added on top of GPU EMA for driver flip pipeline and DMA. |
BUDGET_EMA_ALPHA |
0.15 |
EMA smoothing for render budget. Half-life ≈ 4 frames. |
MIN_SLEEP_NS |
100_000 ns |
Minimum sleep threshold (100 µs). Below this, kernel timer jitter exceeds the correction value. |
CONVERGENCE_WINDOW_NS |
500_000 ns |
Lock detection window (0.5 ms). Frames within this threshold count toward lock. |
LOCK_THRESHOLD_FRAMES |
8 |
Consecutive on-time frames required to enter Locked tracking mode. |
| Initial render budget |
70% of period |
Conservative seed before EMA has real GPU data. |
| Budget clamp min |
10% of period |
Floor to prevent deadline from being too close to vblank edge. |
| Budget clamp max |
80% of period |
Ceiling to prevent deadline from being so early that the compositor holds the buffer. |
| Locked-mode gain | 0.5 × {KP, KI} |
Half gains in tracking mode to reduce jitter injection into a stable loop. |
render() top of frame
│
├─ gpu_timer.poll() [read GPU timestamps from previous frame]
│
├─ flip_tracker.queue.pop_front() [drain DRM_EVENT_FLIP_COMPLETE]
│ └─ flip_record → {flip_ns, sequence}
│ ├─ hw_vblank_ns = flip_ns [for grid anchoring]
│ └─ flip_latency_ms [for frame log]
│
├─ PllController::compute_deadline(
│ phase_drift_ns ← PacingAnalyzer::last_phase_drift_ns()
│ last_vblank_mul ← PacingAnalyzer::last_vblank_mul()
│ hw_vblank_ns ← flip_ns
│ gpu_time_ms ← gpu_timer.last_gpu_time_ms()
│ cpu_frame_ms ← self.prev_cpu_frame_ms
│ )
│ ├─ update render_budget EMA
│ ├─ advance next_vblank_ns (anchor to flip_ns on first call)
│ ├─ PI correction on phase_drift_ns
│ └─ deadline = next_vblank - budget - PI_correction
│
├─ sleep_until(deadline_ns) [clock_nanosleep CLOCK_MONOTONIC TIMER_ABSTIME]
│
├─ get_current_texture() [submit point, rate-limited by sleep above]
├─ encode render pass
├─ queue.submit()
├─ output.present()
│
├─ prev_cpu_frame_ms = cpu_frame_ms [store for next frame's budget]
│
└─ PacingAnalyzer::push(
ts_ns ← adapter.get_presentation_timestamp()
hw_vblank_ns ← flip_ns
cpu_frame_ms
cpu_submit_ns
gpu_time_ms
flip_latency_ms
)
├─ anchor phase grid to hw_vblank_ns on first delivery
├─ compute drift_ns = ts_ns - ideal_ts_ns (signed)
├─ sync_score = 100 × (1 - |drift_ms| / half_period_ms)
└─ store last_phase_drift_ns, last_vblank_mul [for next frame's PLL input]
The driver does not block inside get_current_texture(). The pre-submit sleep has direct control over the submit instant. The PLL achieves full rate-limiting and phase alignment. Use these modes with --pll for best results.
The driver blocks inside get_current_texture() until the next vblank, consuming most or all of the sleep budget. The application cannot submit faster than one-per-vblank regardless (the driver enforces this), so the FPS cap is still effective. However, the precise phase alignment is less controllable because the wakeup point is determined by the driver, not by the PLL. A startup warning is printed when --pll --mode fifo is detected.
When --pll and --frame-log are both active, each NDJSON row gains five additional fields:
| Field | Type | Description |
|---|---|---|
pll_error_ns |
i64 |
Phase error fed into this frame's PI iteration. Positive = late. |
pll_sleep_ns |
u64 |
Sleep duration actually issued. 0 when deadline passed or correction < 100 µs. |
pll_deadline_ns |
u64 |
Absolute CLOCK_MONOTONIC target the PLL slept until. Useful for verifying grid alignment against ts_ns. |
pll_budget_ns |
u64 |
Render budget EMA at this frame. Divide by 1 000 000 for milliseconds. |
pll_lock |
0 or 1
|
1 when the controller is in Locked (tracking) mode. |
Example row (120 Hz, PLL locked):
{
"schema": 5,
"frame": 1240,
"ts_ns": 548293810000,
"delta_ms": 8.3330,
"ideal_ms": 8.3333,
"drift_ms": 0.0821,
"sync": 98.04,
"pll_error_ns": 82100,
"pll_sleep_ns": 5100000,
"pll_deadline_ns": 548285700000,
"pll_budget_ns": 3200000,
"pll_lock": 1
}Wayland compositor scheduling policy.
Even with a perfectly-timed submit, the compositor may hold a finished buffer across a vblank (Mutter's max_render_time, KWin's buffer-age logic). The PLL corrects for constant compositor latency via the integral term, but cannot compensate for variable hold times that change frame-to-frame. flip_latency_ms in the frame log exposes when this is the bottleneck rather than the GPU or PLL.
No direct scanout / KMS bypass.
True "racing the beam" (writing pixels as the electron beam scans) is impossible in userspace on a compositor-mediated display stack. The compositor owns the KMS fd exclusively. The PLL achieves the closest approximation available to unprivileged processes: presenting the buffer to the compositor at the optimal moment so it is available at the next vblank without excess queuing.
FlipTracker dependency for full accuracy.
The hardware phase anchor requires DRM_EVENT_FLIP_COMPLETE timestamps from FlipTracker. Without DRM access, the grid falls back to a floating WSI origin and sync_score measures consistency rather than absolute phase. The PLL still rate-limits correctly in this mode but may not converge to the hardware vblank edge.
Kernel timer granularity.
clock_nanosleep on a tickless kernel (CONFIG_HZ_1000) has an effective wakeup jitter of approximately 50–100 µs. The minimum sleep threshold of 100 µs exists for this reason. At very high refresh rates (240 Hz, period = 4.2 ms), this jitter represents ~2.4% of the frame budget, setting a floor on achievable phase_drift_ms.
One-frame-lag feedback.
The phase error fed into the PI controller is always from the previous frame's presentation timestamp, the current frame's timestamp is not available until after present() returns, which is after the sleep that needs the correction. This one-frame lag is intrinsic to the control loop structure and is handled by the integrator term, which accumulates the offset across frames.