Skip to content

Conversation

@limepoutine
Copy link
Contributor

@limepoutine limepoutine commented Jan 24, 2026

This PR serves two purposes: as an optimization and as a fix for #22448.

Optimization

Intel's optimization manual recommends using a combination of STOSB and REP STOSD. On Ivy Bridge or newer, REP STOSB alone is faster, but DMD cannot assume that.

To improve address alignment, a small piece of prolog code using MOVSB/STOSB with a count less than 4 can be used to peel off the non-aligned data moves before starting to use MOVSD/STOSD.

Starting from Nehalem microarchitecture (2007), Intel CPUs internally performs 16-byte moves, thus rendering REP STOSQ equally fast to REP STOSD. Emitting STOSQ followed by STOSD only increases latency. The following also applies to STOS.

The two components of performance characteristics of REP String varies further depending on granularity, align-
ment, and/or count values. Generally, MOVSB is used to handle very small chunks of data. Therefore, processor
implementation of REP MOVSB is optimized to handle ECX < 4. Using REP MOVSB with ECX > 3 will achieve low data
throughput due to not only byte-granular data transfer but also additional startup overhead. The latency for MOVSB,
is 9 cycles if ECX < 4; otherwise REP MOVSB with ECX >9 have a 50-cycle startup cost.
For REP string of larger granularity data transfer, as ECX value increases, the startup overhead of REP String exhibit
step-wise increase:
• Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles.
• Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by
moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-
byte data transfer spans across cache line boundary:
— Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds 4 cycles.
— Cache splits: the latency consists of a startup cost of about 35 cycles and each 64 bytes of data adds 6cycles.
• Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles
plus one cycle for each iteration of the data movement in word/dword/qword.

AMD processors starting from Bulldozer (2011) also implement fast string optimizations. Newer manuals do not mention the REP prefix, instead recommending using SIMD for block transfer, which is infeasible to implement in DMD.

Family 15h processors include a special "fast-string" microcode implementation that may be executed
when the REP prefix is applied to the MOVS or STOS instructions. The following conditions must be
met to benefit from the "fast string" microcode:
• The REP prefix is applied to the MOVS or STOS instruction (other instructions such as CMPS do
not have fast-string microcode).
• The memory destination address (and the source address for MOVS) must be data-size aligned so
any fraction of a 128-bit transfer at the end can be handled by the regular micro code.
• DF must be 0.
• The source and destination blocks must not overlap (for MOVS).
The memory addresses do not need to be 16-byte aligned, nor does the number of bytes to be moved
or stored need to be divisible by 16, but meeting either of these conditions will improve efficiency.
The microcode executes data-sized moves (or stores) until a 16-byte boundary is reached, then moves
16 bytes at a time until fewer than 16 bytes remain to be moved (or stored), then executes data-sized
moves to finish.

Micro-benchmark

import core.time;
import core.stdc.stdio;

void fake_memset(char* ptr, char c, size_t n)
{
    ptr[0 .. n] = c;
}

void test_memset(alias arr, uint n)()
{

    auto t0 = MonoTime.currTime;
    foreach (_; 0 .. 100_000_000)
    {
        fake_memset(arr.ptr, 0xcc, n);
    }
    auto t1 = MonoTime.currTime;
    printf("memset n=%d time=%s\n", n, (t1 - t0).toString().ptr);
}

void main()
{
    char[100] arr;
    test_memset!(arr, 32); // 8-byte aligned
    test_memset!(arr, 36); // 4-byte aligned
    test_memset!(arr, 39);
}

With master:

memset n=32 time=2 secs, 32 ms, 296 μs, and 2 hnsecs
memset n=36 time=1 sec, 514 ms, 984 μs, and 8 hnsecs
memset n=39 time=1 sec, 512 ms, 711 μs, and 5 hnsecs

With this PR:

memset n=32 time=898 ms, 20 μs, and 9 hnsecs
memset n=36 time=881 ms, 713 μs, and 8 hnsecs
memset n=39 time=902 ms, 584 μs, and 1 hnsec

This benchmark intentionally chooses "intermediate string lengths" in Intel's manual, to show that emitting STOSQ does not help even in the worst case.

Fixing #22448

Removing I64 code paths allows simply flipping several int variables to long, so cases where offset > int.max are compiled correctly. Which is rather trivial.

There is no test as allocating 2GiB will be a disaster on CI. If possible, please leave some suggestions on how to test this.

obligatory cc @WalterBright

@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @limepoutine! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#22449"

@limepoutine limepoutine changed the title Don't emit REP STOSQ on x64 Don't emit REP MOVSQ/STOSQ on x64 Jan 25, 2026
@limepoutine
Copy link
Contributor Author

Benchmark for OPstreq:

import core.time;
import core.stdc.stdio;

void test_streq(alias src, alias dst, uint n)()
{
    src[0 .. n] = 0xcc;

    auto t0 = MonoTime.currTime;
    foreach (_; 0 .. 100_000_000)
    {
        dst[0 .. n] = src[0 .. n];
    }
    auto t1 = MonoTime.currTime;
    printf("streq n=%d time=%s\n", n, (t1 - t0).toString().ptr);
}

void main()
{
    char[300] src, dst;

    test_streq!(src, dst, 64); // 8-byte aligned
    test_streq!(src, dst, 68); // 4-byte aligned
    test_streq!(src, dst, 71);

    test_streq!(src, dst, 256); // 8-byte aligned
    test_streq!(src, dst, 260); // 4-byte aligned
    test_streq!(src, dst, 263);
}

With master:

streq n=64 time=699 ms and 868 μs
streq n=68 time=769 ms, 621 μs, and 3 hnsecs
streq n=71 time=1 sec, 36 ms, 322 μs, and 2 hnsecs
streq n=256 time=692 ms and 377 μs
streq n=260 time=782 ms, 547 μs, and 4 hnsecs
streq n=263 time=1 sec, 34 ms, 642 μs, and 4 hnsecs

With this PR:

streq n=64 time=704 ms, 567 μs, and 8 hnsecs
streq n=68 time=696 ms and 660 μs
streq n=71 time=930 ms, 164 μs, and 4 hnsecs
streq n=256 time=680 ms, 134 μs, and 6 hnsecs
streq n=260 time=757 ms, 318 μs, and 2 hnsecs
streq n=263 time=950 ms, 409 μs, and 4 hnsecs

Also fixed an ICE under -O related to integer overflow.

@limepoutine
Copy link
Contributor Author

Turns out the optimizer can choke on types whose size doesn't fit in int... Here comes sz <= int.max ugliness.

I wonder if there are plans to make the backend fully 64-bit compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants