-
-
Notifications
You must be signed in to change notification settings - Fork 668
Don't emit REP MOVSQ/STOSQ on x64 #22449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thanks for your pull request and interest in making D better, @limepoutine! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please see CONTRIBUTING.md for more information. If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment. Bugzilla referencesYour PR doesn't reference any Bugzilla issue. If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog. Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + dmd#22449" |
|
Benchmark for import core.time;
import core.stdc.stdio;
void test_streq(alias src, alias dst, uint n)()
{
src[0 .. n] = 0xcc;
auto t0 = MonoTime.currTime;
foreach (_; 0 .. 100_000_000)
{
dst[0 .. n] = src[0 .. n];
}
auto t1 = MonoTime.currTime;
printf("streq n=%d time=%s\n", n, (t1 - t0).toString().ptr);
}
void main()
{
char[300] src, dst;
test_streq!(src, dst, 64); // 8-byte aligned
test_streq!(src, dst, 68); // 4-byte aligned
test_streq!(src, dst, 71);
test_streq!(src, dst, 256); // 8-byte aligned
test_streq!(src, dst, 260); // 4-byte aligned
test_streq!(src, dst, 263);
}With master: With this PR: Also fixed an ICE under |
|
Turns out the optimizer can choke on types whose size doesn't fit in int... Here comes I wonder if there are plans to make the backend fully 64-bit compatible. |
This PR serves two purposes: as an optimization and as a fix for #22448.
Optimization
Intel's optimization manual recommends using a combination of STOSB and REP STOSD. On Ivy Bridge or newer, REP STOSB alone is faster, but DMD cannot assume that.
Starting from Nehalem microarchitecture (2007), Intel CPUs internally performs 16-byte moves, thus rendering REP STOSQ equally fast to REP STOSD. Emitting STOSQ followed by STOSD only increases latency. The following also applies to STOS.
AMD processors starting from Bulldozer (2011) also implement fast string optimizations. Newer manuals do not mention the REP prefix, instead recommending using SIMD for block transfer, which is infeasible to implement in DMD.
Micro-benchmark
With master:
With this PR:
This benchmark intentionally chooses "intermediate string lengths" in Intel's manual, to show that emitting STOSQ does not help even in the worst case.
Fixing #22448
Removing I64 code paths allows simply flipping several
intvariables tolong, so cases whereoffset > int.maxare compiled correctly. Which is rather trivial.There is no test as allocating 2GiB will be a disaster on CI. If possible, please leave some suggestions on how to test this.
obligatory cc @WalterBright