Skip to content

EVM: optimize OpSwap to use direct list manipulation#1044

Merged
blishko merged 2 commits intomainfrom
perf-swap
Mar 9, 2026
Merged

EVM: optimize OpSwap to use direct list manipulation#1044
blishko merged 2 commits intomainfrom
perf-swap

Conversation

@elopez
Copy link
Collaborator

@elopez elopez commented Mar 7, 2026

Description

Replace optics-based ix/zoom swap with direct splitAt-based list swap. The old code used two ix traversals + zoom + two .= assignments. The new code uses a single splitAt + list construction.

bench-perf shows a ∼10% reduction in primes and loop. The optimization was suggested by Claude Code.

Checklist

  • tested locally
  • added automated tests
  • updated the docs
  • updated the changelog

Replace optics-based ix/zoom swap with direct splitAt-based list swap.
The old code used two ix traversals + zoom + two .= assignments.
The new code uses a single splitAt + list construction.

bench-perf shows a ∼10% reduction in primes and loop
@blishko
Copy link
Collaborator

blishko commented Mar 9, 2026

Have you considered the more compact version?

OpSwap i -> {-# SCC "OpSwap" #-}
          let idx = into i in
          case splitAt idx stk of
          (e0:middle, ei:after) ->
            burn g_verylow $ do
              next
              assign' (#state % #stack) $ ei : middle ++ (e0 : after)
          _ -> underrun

I am getting uncertain results compared to your version.
The numbers seem to be slightly worse for some benchmarks, slightly better for other benchmarks.

@elopez
Copy link
Collaborator Author

elopez commented Mar 9, 2026

I hadn't tried that one, but just benchmarked it against the one on the PR and indeed the results are inconclusive among the two. I'm ok with either of them, the resulting code seems to be quite similar, see below.

Claude's interpretation from -ddump-simpl

OpSwap GHC Core Comparison: 3 Variants

Source Code

Variant 0 — Old (Optics: ix, zoom, .=)

OpSwap i -> {-# SCC "OpSwap" #-}
  case (stk ^? ix_i, stk ^? ix_0) of
    (Just ei, Just e0) ->
      burn g_verylow $ do
        next
        zoom (#state % #stack) $ do
          ix_i .= e0
          ix_0 .= ei
    _ -> underrun
  where
    ix_i = ix (into i)
    ix_0 = ix 0

Variant 1 — Current (splitAt (idx-1) rest)

OpSwap i -> {-# SCC "OpSwap" #-}
  let idx = into i in
  case stk of
    e0:rest -> case splitAt (idx - 1) rest of
      (middle, ei:after) ->
        burn g_verylow $ do
          next
          assign' (#state % #stack) $ ei : middle ++ (e0 : after)
      _ -> underrun
    _ -> underrun

Variant 2 — Proposed (splitAt idx stk)

OpSwap i -> {-# SCC "OpSwap" #-}
  let idx = into i in
  case splitAt idx stk of
    (e0:middle, ei:after) ->
      burn g_verylow $ do
        next
        assign' (#state % #stack) $ ei : middle ++ (e0 : after)
    _ -> underrun

Method

Each variant was compiled in isolation with GHC 9.8.4 at -O2 -fworker-wrapper-cbv and
-ddump-simpl -dsuppress-all -dsuppress-uniques -dno-typeable-binds -ddump-to-file.
The Symbolic specialization of exec1 was extracted from each dump (~105K lines each).

Core Output

Variant 0 — Optics

GHC compiles ix into a generic traversal function $wlvl40:

$wlvl40
  = \ @f1 $dFunctor ds eta eta1 ->
      let { lvl23 = ds [] } in
      letrec {
        $s$wgo
          = \ sc sc1 ->
              case sc1 of {
                [] -> lvl23;
                : a1 as ->
                  case sc of ds4 {
                    __DEFAULT ->
                      fmap $dFunctor (\ v -> : a1 v) ($s$wgo (-# ds4 1#) as);
                    0# -> fmap $dFunctor (\ v -> : v as) (eta a1)
                  }
              }; } in
      $s$wgo 0# eta1

And the read half (^?) into a specialized version:

exec1_$s$wgo
  = \ sc sc1 ->
      case sc1 of {
        [] -> Nothing;
        : a1 as ->
          case sc of ds4 {
            __DEFAULT -> exec1_$s$wgo (-# ds4 1#) as;
            0# -> Just a1
          }
      }

The full handler (Symbolic specialization, lines 103058–103230):

-- 1. Two reads: stk ^? ix_i, stk ^? ix_0
ei = $wlvl40 Const (...) (...) ds32   -- traversal 1, O(idx)
e0 = exec1_$s$wgo 0# ds32             -- traversal 2, O(1) but still Maybe-wrapped

case (ei, e0) of
  (Nothing, _) -> underrun
  (_, Nothing) -> underrun
  (Just ei, Just e0) ->
    -- letrec for the write traversal (used by both .= calls)
    letrec {
      $s$wgo = \ sc sc1 ->
        case sc1 of {
          [] -> [];
          : a1 as ->
            case sc of ds4 {
              __DEFAULT -> : a1 ($s$wgo (-# ds4 1#) as);  -- keep element
              0#        -> : e0 as                          -- replace at index
            }
        }
    } in
    burn g_verylow $
      -- 2. next: bump PC → VM reconstruction #1
      -- 3. zoom (#state % #stack) (ix_i .= e0):
      --    deconstruct VM+FrameState, call $wlvl40 with Identity functor
      --    → traversal 3, fmap per cons cell → VM reconstruction #2
      -- 4. zoom (#state % #stack) (ix_0 .= ei):
      --    deconstruct VM+FrameState again
      --    → traversal 4, fmap per cons cell → VM reconstruction #3
      case eta3 of { VM x1 x2 ... x22 ->     -- deconstruct VM
        case x2 of { FrameState ... x68 ... ->  -- deconstruct FrameState
          -- apply ix_i .= e0 via $wlvl40 Identity
          ($wlvl40 Identity (...) (f) ($s$wgo k x68))
          -- then apply ix_0 .= ei via another $wlvl40 call
          -- rebuild VM+FrameState with result
        }
      }

Key characteristics:

  • 4 list traversals: 2 reads (^?) + 2 writes (.= inside zoom)
  • 3 VM+FrameState reconstructions: next + 2× zoom/.=
  • fmap per cons cell in the write traversals (closure allocation overhead)
  • Maybe wrapping for both reads

Variant 1 — Current

case ds32 of {                          -- case stk of
  [] -> underrun;
  : e0 rest ->                          -- e0:rest ->
    let x1 = idx - 1 in
    join {
      $w$j29 middle ds =                -- join point (shared continuation)
        case ds of {
          [] -> underrun;
          : ei after ->
            -- single VM+FrameState reconstruction
            VM ds1
              (FrameState ds29 ds30 ds31 (+# bx 1#)  -- bumped PC (next)
                (: ei (++ middle (: e0 after)))       -- new stack
                ds33 ds34 ...)
              ds3 ds4 ...
        }
    } in
    case <=# x1 0# of {
      1# -> jump $w$j29 [] rest;        -- short-circuit: swap(1)
      __DEFAULT ->
        case splitAt' x1 rest of { (# ww, ww1 #) ->
          jump $w$j29 ww ww1            -- general case
        }
    }

Key characteristics:

  • 1 list traversal: single splitAt
  • 1 VM+FrameState reconstruction: next and assign' fused into one rebuild
  • Join point: GHC hoists the continuation into $w$j29, sharing code between the short-circuit and general paths
  • Short-circuit for swap(1): when idx-1 <= 0, jumps directly with [] rest — no splitAt call at all
  • No functor overhead: direct list construction with : and ++
  • No Maybe wrapping: pattern match handles failure directly

Variant 2 — Proposed

let x1 = idx in                         -- no subtraction needed
case <=# x1 0# of {
  1# -> underrun;                       -- swap(0) = impossible, correct
  __DEFAULT ->
    case splitAt' x1 ds32 of { (# ww, ww1 #) ->
      case ww of {
        [] -> underrun;                  -- prefix too short
        : e0 middle ->
          case ww1 of {
            [] -> underrun;              -- suffix empty
            : ei after ->
              -- single VM+FrameState reconstruction
              VM ds1
                (FrameState ds29 ds30 ds31 (+# bx 1#)
                  (: ei (++ middle (: e0 after)))
                  ds33 ds34 ...)
                ds3 ds4 ...
          }
      }
    }
}

Key characteristics:

  • 1 list traversal: single splitAt
  • 1 VM+FrameState reconstruction: same as variant 1
  • No subtraction: uses idx directly
  • No join point: two nested case matches prevent GHC from creating one
  • No short-circuit for swap(1): splitAt 1 stk always runs; the <=# x1 0# branch goes to underrun (swap(0) is invalid)
  • Two nested pattern matches on the split result vs one in variant 1

Summary

Aspect V0 (Optics) V1 (Current) V2 (Proposed)
List traversals 4 (2 read + 2 write) 1 (splitAt) 1 (splitAt)
VM+FrameState rebuilds 3 (next + 2× zoom) 1 (assign') 1 (assign')
Functor overhead Yes (fmap per cons) None None
Maybe wrapping Yes (2× Just/Nothing) None None
Subtraction No Yes (idx-1) No
Join point No Yes No
Short-circuit swap(1) No Yes (skips splitAt) No (goes to underrun)
Closures per call ~4 × idx ~0 ~0

Verdict

Variant 1 (current) is the best. It performs 4× fewer list traversals than V0 and builds
the VM record only once instead of three times. Compared to V2, it gains a join point
optimization and a short-circuit path for the common swap(1) case. The idx-1
subtraction is a single machine instruction — negligible cost. V2's only advantage
(avoiding that subtraction) does not compensate for losing the join point and short-circuit.

@blishko
Copy link
Collaborator

blishko commented Mar 9, 2026

I have very slight preference for my version, as it expresses the intent more concisely.
But I am fine with your version as well.

@blishko blishko mentioned this pull request Mar 9, 2026
4 tasks
@elopez
Copy link
Collaborator Author

elopez commented Mar 9, 2026

I have very slight preference for my version, as it expresses the intent more concisely. But I am fine with your version as well.

Sounds good, pushed that instead. Feel free to squash if you'd like 👍

@blishko
Copy link
Collaborator

blishko commented Mar 9, 2026

We can keep both versions in the history!

@blishko blishko merged commit 0bc0305 into main Mar 9, 2026
7 of 9 checks passed
@blishko blishko deleted the perf-swap branch March 9, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants