Don't emit REP MOVSQ/STOSQ on x64 #22449

limepoutine · 2026-01-24T10:45:19Z

This PR serves two purposes: as an optimization and as a fix for #22448.

Optimization

Intel's optimization manual recommends using a combination of STOSB and REP STOSD. On Ivy Bridge or newer, REP STOSB alone is faster, but DMD cannot assume that.

To improve address alignment, a small piece of prolog code using MOVSB/STOSB with a count less than 4 can be used to peel off the non-aligned data moves before starting to use MOVSD/STOSD.

Starting from Nehalem microarchitecture (2007), Intel CPUs internally performs 16-byte moves, thus rendering REP STOSQ equally fast to REP STOSD. Emitting STOSQ followed by STOSD only increases latency. The following also applies to STOS.

The two components of performance characteristics of REP String varies further depending on granularity, align-
ment, and/or count values. Generally, MOVSB is used to handle very small chunks of data. Therefore, processor
implementation of REP MOVSB is optimized to handle ECX < 4. Using REP MOVSB with ECX > 3 will achieve low data
throughput due to not only byte-granular data transfer but also additional startup overhead. The latency for MOVSB,
is 9 cycles if ECX < 4; otherwise REP MOVSB with ECX >9 have a 50-cycle startup cost.
For REP string of larger granularity data transfer, as ECX value increases, the startup overhead of REP String exhibit
step-wise increase:
• Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles.
• Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by
moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-
byte data transfer spans across cache line boundary:
— Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds 4 cycles.
— Cache splits: the latency consists of a startup cost of about 35 cycles and each 64 bytes of data adds 6cycles.
• Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles
plus one cycle for each iteration of the data movement in word/dword/qword.

AMD processors starting from Bulldozer (2011) also implement fast string optimizations. Newer manuals do not mention the REP prefix, instead recommending using SIMD for block transfer, which is infeasible to implement in DMD.

Family 15h processors include a special "fast-string" microcode implementation that may be executed
when the REP prefix is applied to the MOVS or STOS instructions. The following conditions must be
met to benefit from the "fast string" microcode:
• The REP prefix is applied to the MOVS or STOS instruction (other instructions such as CMPS do
not have fast-string microcode).
• The memory destination address (and the source address for MOVS) must be data-size aligned so
any fraction of a 128-bit transfer at the end can be handled by the regular micro code.
• DF must be 0.
• The source and destination blocks must not overlap (for MOVS).
The memory addresses do not need to be 16-byte aligned, nor does the number of bytes to be moved
or stored need to be divisible by 16, but meeting either of these conditions will improve efficiency.
The microcode executes data-sized moves (or stores) until a 16-byte boundary is reached, then moves
16 bytes at a time until fewer than 16 bytes remain to be moved (or stored), then executes data-sized
moves to finish.

Micro-benchmark

import core.time;
import core.stdc.stdio;

void fake_memset(char* ptr, char c, size_t n)
{
    ptr[0 .. n] = c;
}

void test_memset(alias arr, uint n)()
{

    auto t0 = MonoTime.currTime;
    foreach (_; 0 .. 100_000_000)
    {
        fake_memset(arr.ptr, 0xcc, n);
    }
    auto t1 = MonoTime.currTime;
    printf("memset n=%d time=%s\n", n, (t1 - t0).toString().ptr);
}

void main()
{
    char[100] arr;
    test_memset!(arr, 32); // 8-byte aligned
    test_memset!(arr, 36); // 4-byte aligned
    test_memset!(arr, 39);
}

With master:

memset n=32 time=2 secs, 32 ms, 296 μs, and 2 hnsecs
memset n=36 time=1 sec, 514 ms, 984 μs, and 8 hnsecs
memset n=39 time=1 sec, 512 ms, 711 μs, and 5 hnsecs

With this PR:

memset n=32 time=898 ms, 20 μs, and 9 hnsecs
memset n=36 time=881 ms, 713 μs, and 8 hnsecs
memset n=39 time=902 ms, 584 μs, and 1 hnsec

This benchmark intentionally chooses "intermediate string lengths" in Intel's manual, to show that emitting STOSQ does not help even in the worst case.

Fixing #22448

Removing I64 code paths allows simply flipping several int variables to long, so cases where offset > int.max are compiled correctly. Which is rather trivial.

There is no test as allocating 2GiB will be a disaster on CI. If possible, please leave some suggestions on how to test this.

obligatory cc @WalterBright

dlang-bot · 2026-01-24T10:45:24Z

Thanks for your pull request and interest in making D better, @limepoutine! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#22449"

limepoutine · 2026-01-25T08:33:22Z

Benchmark for OPstreq:

import core.time;
import core.stdc.stdio;

void test_streq(alias src, alias dst, uint n)()
{
    src[0 .. n] = 0xcc;

    auto t0 = MonoTime.currTime;
    foreach (_; 0 .. 100_000_000)
    {
        dst[0 .. n] = src[0 .. n];
    }
    auto t1 = MonoTime.currTime;
    printf("streq n=%d time=%s\n", n, (t1 - t0).toString().ptr);
}

void main()
{
    char[300] src, dst;

    test_streq!(src, dst, 64); // 8-byte aligned
    test_streq!(src, dst, 68); // 4-byte aligned
    test_streq!(src, dst, 71);

    test_streq!(src, dst, 256); // 8-byte aligned
    test_streq!(src, dst, 260); // 4-byte aligned
    test_streq!(src, dst, 263);
}

With master:

streq n=64 time=699 ms and 868 μs
streq n=68 time=769 ms, 621 μs, and 3 hnsecs
streq n=71 time=1 sec, 36 ms, 322 μs, and 2 hnsecs
streq n=256 time=692 ms and 377 μs
streq n=260 time=782 ms, 547 μs, and 4 hnsecs
streq n=263 time=1 sec, 34 ms, 642 μs, and 4 hnsecs

With this PR:

streq n=64 time=704 ms, 567 μs, and 8 hnsecs
streq n=68 time=696 ms and 660 μs
streq n=71 time=930 ms, 164 μs, and 4 hnsecs
streq n=256 time=680 ms, 134 μs, and 6 hnsecs
streq n=260 time=757 ms, 318 μs, and 2 hnsecs
streq n=263 time=950 ms, 409 μs, and 4 hnsecs

Also fixed an ICE under -O related to integer overflow.

limepoutine · 2026-01-25T09:06:44Z

Turns out the optimizer can choke on types whose size doesn't fit in int... Here comes sz <= int.max ugliness.

I wonder if there are plans to make the backend fully 64-bit compatible.

Don't emit REP STOSQ

565f6a9

thewilsonator approved these changes Jan 24, 2026

View reviewed changes

limepoutine changed the title ~~Don't emit REP STOSQ on x64~~ Don't emit REP MOVSQ/STOSQ on x64 Jan 25, 2026

limepoutine force-pushed the stosq branch from b202ff0 to 3304514 Compare January 25, 2026 08:30

limepoutine force-pushed the stosq branch from 3304514 to 496b96b Compare January 25, 2026 09:02

Fix streq too

2906940

limepoutine force-pushed the stosq branch from 496b96b to 2906940 Compare January 25, 2026 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't emit REP MOVSQ/STOSQ on x64 #22449

Don't emit REP MOVSQ/STOSQ on x64 #22449

limepoutine commented Jan 24, 2026 •

edited

Loading

Uh oh!

dlang-bot commented Jan 24, 2026

Uh oh!

limepoutine commented Jan 25, 2026

Uh oh!

limepoutine commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Don't emit REP MOVSQ/STOSQ on x64 #22449

Are you sure you want to change the base?

Don't emit REP MOVSQ/STOSQ on x64 #22449

Conversation

limepoutine commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimization

Micro-benchmark

Fixing #22448

Uh oh!

dlang-bot commented Jan 24, 2026

Bugzilla references

Testing this PR locally

Uh oh!

limepoutine commented Jan 25, 2026

Uh oh!

limepoutine commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

limepoutine commented Jan 24, 2026 •

edited

Loading