-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[AMDGPU][SDAG] Only fold flat offsets if they are inbounds PTRADDs #165427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/ritter-x2a/10-27-_amdgpu_nfc_mark_geps_in_flat_offset_folding_tests_as_inbounds
Are you sure you want to change the base?
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
@llvm/pr-subscribers-backend-amdgpu Author: Fabian Ritter (ritter-x2a) ChangesFor flat memory instructions where the address is supplied as a base address This patch only selects flat instructions with immediate offsets from PTRADD Affected tests:
Similar previous PR that relied on Fixes SWDEV-516125. Patch is 854.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/165427.diff 8 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index b8b419d93021a..f16eb1649be42 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -1828,72 +1828,83 @@ bool AMDGPUDAGToDAGISel::SelectFlatOffsetImpl(SDNode *N, SDValue Addr,
isFlatScratchBaseLegal(Addr))) {
int64_t COffsetVal = cast<ConstantSDNode>(N1)->getSExtValue();
- const SIInstrInfo *TII = Subtarget->getInstrInfo();
- if (TII->isLegalFLATOffset(COffsetVal, AS, FlatVariant)) {
- Addr = N0;
- OffsetVal = COffsetVal;
- } else {
- // If the offset doesn't fit, put the low bits into the offset field and
- // add the rest.
- //
- // For a FLAT instruction the hardware decides whether to access
- // global/scratch/shared memory based on the high bits of vaddr,
- // ignoring the offset field, so we have to ensure that when we add
- // remainder to vaddr it still points into the same underlying object.
- // The easiest way to do that is to make sure that we split the offset
- // into two pieces that are both >= 0 or both <= 0.
-
- SDLoc DL(N);
- uint64_t RemainderOffset;
-
- std::tie(OffsetVal, RemainderOffset) =
- TII->splitFlatOffset(COffsetVal, AS, FlatVariant);
-
- SDValue AddOffsetLo =
- getMaterializedScalarImm32(Lo_32(RemainderOffset), DL);
- SDValue Clamp = CurDAG->getTargetConstant(0, DL, MVT::i1);
-
- if (Addr.getValueType().getSizeInBits() == 32) {
- SmallVector<SDValue, 3> Opnds;
- Opnds.push_back(N0);
- Opnds.push_back(AddOffsetLo);
- unsigned AddOp = AMDGPU::V_ADD_CO_U32_e32;
- if (Subtarget->hasAddNoCarry()) {
- AddOp = AMDGPU::V_ADD_U32_e64;
- Opnds.push_back(Clamp);
- }
- Addr = SDValue(CurDAG->getMachineNode(AddOp, DL, MVT::i32, Opnds), 0);
+ // Adding the offset to the base address in a FLAT instruction must not
+ // change the memory aperture in which the address falls. Therefore we can
+ // only fold offsets from inbounds GEPs into FLAT instructions.
+ bool IsInBounds =
+ Addr.getOpcode() == ISD::PTRADD && Addr->getFlags().hasInBounds();
+ if (COffsetVal == 0 || FlatVariant != SIInstrFlags::FLAT || IsInBounds) {
+ const SIInstrInfo *TII = Subtarget->getInstrInfo();
+ if (TII->isLegalFLATOffset(COffsetVal, AS, FlatVariant)) {
+ Addr = N0;
+ OffsetVal = COffsetVal;
} else {
- // TODO: Should this try to use a scalar add pseudo if the base address
- // is uniform and saddr is usable?
- SDValue Sub0 = CurDAG->getTargetConstant(AMDGPU::sub0, DL, MVT::i32);
- SDValue Sub1 = CurDAG->getTargetConstant(AMDGPU::sub1, DL, MVT::i32);
-
- SDNode *N0Lo = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
- DL, MVT::i32, N0, Sub0);
- SDNode *N0Hi = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
- DL, MVT::i32, N0, Sub1);
-
- SDValue AddOffsetHi =
- getMaterializedScalarImm32(Hi_32(RemainderOffset), DL);
-
- SDVTList VTs = CurDAG->getVTList(MVT::i32, MVT::i1);
-
- SDNode *Add =
- CurDAG->getMachineNode(AMDGPU::V_ADD_CO_U32_e64, DL, VTs,
- {AddOffsetLo, SDValue(N0Lo, 0), Clamp});
-
- SDNode *Addc = CurDAG->getMachineNode(
- AMDGPU::V_ADDC_U32_e64, DL, VTs,
- {AddOffsetHi, SDValue(N0Hi, 0), SDValue(Add, 1), Clamp});
-
- SDValue RegSequenceArgs[] = {
- CurDAG->getTargetConstant(AMDGPU::VReg_64RegClassID, DL, MVT::i32),
- SDValue(Add, 0), Sub0, SDValue(Addc, 0), Sub1};
-
- Addr = SDValue(CurDAG->getMachineNode(AMDGPU::REG_SEQUENCE, DL,
- MVT::i64, RegSequenceArgs),
- 0);
+ // If the offset doesn't fit, put the low bits into the offset field
+ // and add the rest.
+ //
+ // For a FLAT instruction the hardware decides whether to access
+ // global/scratch/shared memory based on the high bits of vaddr,
+ // ignoring the offset field, so we have to ensure that when we add
+ // remainder to vaddr it still points into the same underlying object.
+ // The easiest way to do that is to make sure that we split the offset
+ // into two pieces that are both >= 0 or both <= 0.
+
+ SDLoc DL(N);
+ uint64_t RemainderOffset;
+
+ std::tie(OffsetVal, RemainderOffset) =
+ TII->splitFlatOffset(COffsetVal, AS, FlatVariant);
+
+ SDValue AddOffsetLo =
+ getMaterializedScalarImm32(Lo_32(RemainderOffset), DL);
+ SDValue Clamp = CurDAG->getTargetConstant(0, DL, MVT::i1);
+
+ if (Addr.getValueType().getSizeInBits() == 32) {
+ SmallVector<SDValue, 3> Opnds;
+ Opnds.push_back(N0);
+ Opnds.push_back(AddOffsetLo);
+ unsigned AddOp = AMDGPU::V_ADD_CO_U32_e32;
+ if (Subtarget->hasAddNoCarry()) {
+ AddOp = AMDGPU::V_ADD_U32_e64;
+ Opnds.push_back(Clamp);
+ }
+ Addr =
+ SDValue(CurDAG->getMachineNode(AddOp, DL, MVT::i32, Opnds), 0);
+ } else {
+ // TODO: Should this try to use a scalar add pseudo if the base
+ // address is uniform and saddr is usable?
+ SDValue Sub0 =
+ CurDAG->getTargetConstant(AMDGPU::sub0, DL, MVT::i32);
+ SDValue Sub1 =
+ CurDAG->getTargetConstant(AMDGPU::sub1, DL, MVT::i32);
+
+ SDNode *N0Lo = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
+ DL, MVT::i32, N0, Sub0);
+ SDNode *N0Hi = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
+ DL, MVT::i32, N0, Sub1);
+
+ SDValue AddOffsetHi =
+ getMaterializedScalarImm32(Hi_32(RemainderOffset), DL);
+
+ SDVTList VTs = CurDAG->getVTList(MVT::i32, MVT::i1);
+
+ SDNode *Add =
+ CurDAG->getMachineNode(AMDGPU::V_ADD_CO_U32_e64, DL, VTs,
+ {AddOffsetLo, SDValue(N0Lo, 0), Clamp});
+
+ SDNode *Addc = CurDAG->getMachineNode(
+ AMDGPU::V_ADDC_U32_e64, DL, VTs,
+ {AddOffsetHi, SDValue(N0Hi, 0), SDValue(Add, 1), Clamp});
+
+ SDValue RegSequenceArgs[] = {
+ CurDAG->getTargetConstant(AMDGPU::VReg_64RegClassID, DL,
+ MVT::i32),
+ SDValue(Add, 0), Sub0, SDValue(Addc, 0), Sub1};
+
+ Addr = SDValue(CurDAG->getMachineNode(AMDGPU::REG_SEQUENCE, DL,
+ MVT::i64, RegSequenceArgs),
+ 0);
+ }
}
}
}
diff --git a/llvm/test/CodeGen/AMDGPU/fold-gep-offset.ll b/llvm/test/CodeGen/AMDGPU/fold-gep-offset.ll
index 9c49aade6099f..614500287339b 100644
--- a/llvm/test/CodeGen/AMDGPU/fold-gep-offset.ll
+++ b/llvm/test/CodeGen/AMDGPU/fold-gep-offset.ll
@@ -24,96 +24,82 @@
; gep[inbounds](p, i + 3) -> gep(gep(p, i), 3)
-; FIXME the offset here should not be folded: if %p points to the beginning of
+; The offset here cannot be folded: if %p points to the beginning of scratch or
; scratch or LDS and %i is -1, a folded offset crashes the program.
define i32 @flat_offset_maybe_oob(ptr %p, i32 %i) {
-; GFX90A-SDAG-LABEL: flat_offset_maybe_oob:
-; GFX90A-SDAG: ; %bb.0:
-; GFX90A-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX90A-SDAG-NEXT: v_ashrrev_i32_e32 v3, 31, v2
-; GFX90A-SDAG-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
-; GFX90A-SDAG-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
-; GFX90A-SDAG-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
-; GFX90A-SDAG-NEXT: flat_load_dword v0, v[0:1] offset:12
-; GFX90A-SDAG-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX90A-SDAG-NEXT: s_setpc_b64 s[30:31]
+; GFX90A-LABEL: flat_offset_maybe_oob:
+; GFX90A: ; %bb.0:
+; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: v_ashrrev_i32_e32 v3, 31, v2
+; GFX90A-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
+; GFX90A-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
+; GFX90A-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
+; GFX90A-NEXT: v_add_co_u32_e32 v0, vcc, 12, v0
+; GFX90A-NEXT: v_addc_co_u32_e32 v1, vcc, 0, v1, vcc
+; GFX90A-NEXT: flat_load_dword v0, v[0:1]
+; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: s_setpc_b64 s[30:31]
;
-; GFX10-SDAG-LABEL: flat_offset_maybe_oob:
-; GFX10-SDAG: ; %bb.0:
-; GFX10-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-SDAG-NEXT: v_ashrrev_i32_e32 v3, 31, v2
-; GFX10-SDAG-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
-; GFX10-SDAG-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
-; GFX10-SDAG-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
-; GFX10-SDAG-NEXT: flat_load_dword v0, v[0:1] offset:12
-; GFX10-SDAG-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX10-SDAG-NEXT: s_setpc_b64 s[30:31]
+; GFX10-LABEL: flat_offset_maybe_oob:
+; GFX10: ; %bb.0:
+; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT: v_ashrrev_i32_e32 v3, 31, v2
+; GFX10-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
+; GFX10-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
+; GFX10-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
+; GFX10-NEXT: v_add_co_u32 v0, vcc_lo, v0, 12
+; GFX10-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
+; GFX10-NEXT: flat_load_dword v0, v[0:1]
+; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX10-NEXT: s_setpc_b64 s[30:31]
;
; GFX942-SDAG-LABEL: flat_offset_maybe_oob:
; GFX942-SDAG: ; %bb.0:
; GFX942-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX942-SDAG-NEXT: v_ashrrev_i32_e32 v3, 31, v2
; GFX942-SDAG-NEXT: v_lshl_add_u64 v[0:1], v[2:3], 2, v[0:1]
-; GFX942-SDAG-NEXT: flat_load_dword v0, v[0:1] offset:12
+; GFX942-SDAG-NEXT: v_lshl_add_u64 v[0:1], v[0:1], 0, 12
+; GFX942-SDAG-NEXT: flat_load_dword v0, v[0:1]
; GFX942-SDAG-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX942-SDAG-NEXT: s_setpc_b64 s[30:31]
;
-; GFX11-SDAG-LABEL: flat_offset_maybe_oob:
-; GFX11-SDAG: ; %bb.0:
-; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-SDAG-NEXT: v_ashrrev_i32_e32 v3, 31, v2
-; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-SDAG-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
-; GFX11-SDAG-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
-; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-SDAG-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
-; GFX11-SDAG-NEXT: flat_load_b32 v0, v[0:1] offset:12
-; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX11-SDAG-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX12-SDAG-LABEL: flat_offset_maybe_oob:
-; GFX12-SDAG: ; %bb.0:
-; GFX12-SDAG-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX12-SDAG-NEXT: s_wait_expcnt 0x0
-; GFX12-SDAG-NEXT: s_wait_samplecnt 0x0
-; GFX12-SDAG-NEXT: s_wait_bvhcnt 0x0
-; GFX12-SDAG-NEXT: s_wait_kmcnt 0x0
-; GFX12-SDAG-NEXT: v_ashrrev_i32_e32 v3, 31, v2
-; GFX12-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-SDAG-NEXT: v_lshlrev_b64_e32 v[2:3], 2, v[2:3]
-; GFX12-SDAG-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
-; GFX12-SDAG-NEXT: s_wait_alu 0xfffd
-; GFX12-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_2)
-; GFX12-SDAG-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
-; GFX12-SDAG-NEXT: flat_load_b32 v0, v[0:1] offset:12
-; GFX12-SDAG-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX12-SDAG-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX90A-GISEL-LABEL: flat_offset_maybe_oob:
-; GFX90A-GISEL: ; %bb.0:
-; GFX90A-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX90A-GISEL-NEXT: v_ashrrev_i32_e32 v3, 31, v2
-; GFX90A-GISEL-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
-; GFX90A-GISEL-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
-; GFX90A-GISEL-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
-; GFX90A-GISEL-NEXT: v_add_co_u32_e32 v0, vcc, 12, v0
-; GFX90A-GISEL-NEXT: v_addc_co_u32_e32 v1, vcc, 0, v1, vcc
-; GFX90A-GISEL-NEXT: flat_load_dword v0, v[0:1]
-; GFX90A-GISEL-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX90A-GISEL-NEXT: s_setpc_b64 s[30:31]
+; GFX11-LABEL: flat_offset_maybe_oob:
+; GFX11: ; %bb.0:
+; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT: v_ashrrev_i32_e32 v3, 31, v2
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
+; GFX11-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
+; GFX11-NEXT: v_add_co_u32 v0, vcc_lo, v0, 12
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
+; GFX11-NEXT: flat_load_b32 v0, v[0:1]
+; GFX11-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX11-NEXT: s_setpc_b64 s[30:31]
;
-; GFX10-GISEL-LABEL: flat_offset_maybe_oob:
-; GFX10-GISEL: ; %bb.0:
-; GFX10-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-GISEL-NEXT: v_ashrrev_i32_e32 v3, 31, v2
-; GFX10-GISEL-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
-; GFX10-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
-; GFX10-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
-; GFX10-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, 12
-; GFX10-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
-; GFX10-GISEL-NEXT: flat_load_dword v0, v[0:1]
-; GFX10-GISEL-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX10-GISEL-NEXT: s_setpc_b64 s[30:31]
+; GFX12-LABEL: flat_offset_maybe_oob:
+; GFX12: ; %bb.0:
+; GFX12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-NEXT: s_wait_expcnt 0x0
+; GFX12-NEXT: s_wait_samplecnt 0x0
+; GFX12-NEXT: s_wait_bvhcnt 0x0
+; GFX12-NEXT: s_wait_kmcnt 0x0
+; GFX12-NEXT: v_ashrrev_i32_e32 v3, 31, v2
+; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-NEXT: v_lshlrev_b64_e32 v[2:3], 2, v[2:3]
+; GFX12-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
+; GFX12-NEXT: s_wait_alu 0xfffd
+; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
+; GFX12-NEXT: v_add_co_u32 v0, vcc_lo, v0, 12
+; GFX12-NEXT: s_wait_alu 0xfffd
+; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_2)
+; GFX12-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
+; GFX12-NEXT: flat_load_b32 v0, v[0:1]
+; GFX12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-NEXT: s_setpc_b64 s[30:31]
;
; GFX942-GISEL-LABEL: flat_offset_maybe_oob:
; GFX942-GISEL: ; %bb.0:
@@ -126,44 +112,6 @@ define i32 @flat_offset_maybe_oob(ptr %p, i32 %i) {
; GFX942-GISEL-NEXT: flat_load_dword v0, v[0:1]
; GFX942-GISEL-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX942-GISEL-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX11-GISEL-LABEL: flat_offset_maybe_oob:
-; GFX11-GISEL: ; %bb.0:
-; GFX11-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-GISEL-NEXT: v_ashrrev_i32_e32 v3, 31, v2
-; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-GISEL-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
-; GFX11-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
-; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
-; GFX11-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, 12
-; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
-; GFX11-GISEL-NEXT: flat_load_b32 v0, v[0:1]
-; GFX11-GISEL-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX11-GISEL-NEXT: s_setpc_b64 s[30:31]
-;
-; GFX12-GISEL-LABEL: flat_offset_maybe_oob:
-; GFX12-GISEL: ; %bb.0:
-; GFX12-GISEL-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX12-GISEL-NEXT: s_wait_expcnt 0x0
-; GFX12-GISEL-NEXT: s_wait_samplecnt 0x0
-; GFX12-GISEL-NEXT: s_wait_bvhcnt 0x0
-; GFX12-GISEL-NEXT: s_wait_kmcnt 0x0
-; GFX12-GISEL-NEXT: v_ashrrev_i32_e32 v3, 31, v2
-; GFX12-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-GISEL-NEXT: v_lshlrev_b64_e32 v[2:3], 2, v[2:3]
-; GFX12-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
-; GFX12-GISEL-NEXT: s_wait_alu 0xfffd
-; GFX12-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX12-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
-; GFX12-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v0, 12
-; GFX12-GISEL-NEXT: s_wait_alu 0xfffd
-; GFX12-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2)
-; GFX12-GISEL-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, vcc_lo
-; GFX12-GISEL-NEXT: flat_load_b32 v0, v[0:1]
-; GFX12-GISEL-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX12-GISEL-NEXT: s_setpc_b64 s[30:31]
%idx = add nsw i32 %i, 3
%arrayidx = getelementptr inbounds i32, ptr %p, i32 %idx
%l = load i32, ptr %arrayidx
@@ -273,13 +221,742 @@ define i32 @private_offset_maybe_oob(ptr addrspace(5) %p, i32 %i) {
%l = load i32, ptr addrspace(5) %arrayidx
ret i32 %l
}
+
+; If the GEP that adds the offset is inbounds, folding the offset is legal.
+define i32 @flat_offset_inbounds(ptr %p, i32 %i) {
+; GFX90A-LABEL: flat_offset_inbounds:
+; GFX90A: ; %bb.0:
+; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: v_ashrrev_i32_e32 v3, 31, v2
+; GFX90A-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
+; GFX90A-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
+; GFX90A-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
+; GFX90A-NEXT: flat_load_dword v0, v[0:1] offset:12
+; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX90A-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: flat_offset_inbounds:
+; GFX10: ; %bb.0:
+; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT: v_ashrrev_i32_e32 v3, 31, v2
+; GFX10-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
+; GFX10-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
+; GFX10-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
+; GFX10-NEXT: flat_load_dword v0, v[0:1] offset:12
+; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX10-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX942-LABEL: flat_offset_inbounds:
+; GFX942: ; %bb.0:
+; GFX942-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX942-NEXT: v_ashrrev_i32_e32 v3, 31, v2
+; GFX942-NEXT: v_lshl_add_u64 v[0:1], v[2:3], 2, v[0:1]
+; GFX942-NEXT: flat_load_dword v0, v[0:1] offset:12
+; GFX942-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX942-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX11-LABEL: flat_offset_inbounds:
+; GFX11: ; %bb.0:
+; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT: v_ashrrev_i32_e32 v3, 31, v2
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT: v_lshlrev_b64 v[2:3], 2, v[2:3]
+; GFX11-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT: v_add_co_ci_u32_e64 v1, null, v1, v3, vcc_lo
+; GFX11-NEXT: flat_load_b32 v0, v[0:1] offset:12
+; GFX11-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX11-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX12-LABEL: flat_offset_inbounds:
+; GFX12: ; %bb.0:
+; GFX12-NEXT: s_wait_loadcnt_dscnt 0x0
+; GFX12-NEXT: s_wait_expcnt 0x0
+; GFX12-NEXT: s_wait_samplecnt 0x0
+; GFX12-NEXT: s_wait_bvhcnt 0x0
+; GFX12-NEXT: s_wait_kmcnt 0x0
+; GFX12-NEXT: v_ashrrev_i32_e32 v3, 31, v2
+; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12-NEXT: v_lshlrev_b64_e32 v[2:3], 2, v[2:3]
+; GFX12-NEXT: v_add_co_u32 v0, vcc_lo, v0, v2
+; GFX12-NEXT: s_wait_alu 0xfffd
+; GFX12-NEXT: s_delay_al...
[truncated]
|
88db0ea to
eb5fdab
Compare
60a10b2 to
ed34e66
Compare
Squash of the following upstream PRs for downstream testing: - llvm#165424 - llvm#165425 - llvm#165426 - llvm#165427 Regenerated outputs for the following tests to resolve merge conflicts: - llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll - llvm/test/CodeGen/AMDGPU/preserve-wwm-copy-dst-reg.ll For SWDEV-516125.
ed34e66 to
073ef09
Compare
eb5fdab to
56b1e91
Compare
073ef09 to
2f2d265
Compare
For flat memory instructions where the address is supplied as a base address register with an immediate offset, the memory aperture test ignores the immediate offset. Currently, SDISel does not respect that, which leads to miscompilations where valid input programs crash when the address computation relies on the immediate offset to get the base address in the proper memory aperture. Global or scratch instructions are not affected. This patch only selects flat instructions with immediate offsets from PTRADD address computations with the inbounds flag: If the PTRADD does not leave the bounds of the allocated object, it cannot leave the bounds of the memory aperture and is therefore safe to handle with an immediate offset. Affected tests: - CodeGen/AMDGPU/fold-gep-offset.ll: Offsets are no longer wrongly folded, added new positive tests where we still do fold them. - CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll: Offset folding doesn't seem integral to this test, so the test is not changed to make offset folding still happen. - CodeGen/AMDGPU/loop-prefetch-data.ll: loop-reduce transforms inbounds addresses for accesses to be based on potentially OOB addresses used for prefetching. - I think the remaining ones suffer from the limited preservation of the inbounds flag in PTRADD DAGCombines due to the provenance problems pointed out in PR #165424 and the fact that `AMDGPUTargetLowering::SplitVector{Load|Store}` legalizes too-wide accesses by repeatedly splitting them in half. Legalizing a V32S32 memory accesses therefore leads to inbounds ptradd chains like (ptradd inbounds (ptradd inbounds (ptradd inbounds P, 64), 32), 16). The DAGCombines fold them into a single ptradd, but the involved transformations generally cannot preserve the inbounds flag (even though it would be valid in this case). Similar previous PR that relied on `ISD::ADD inbounds` instead of `ISD::PTRADD inbounds` (closed): #132353 Analogous PR for GISel (merged): #153001 Fixes SWDEV-516125.
56b1e91 to
4a42fa5
Compare

For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, SDISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.
This patch only selects flat instructions with immediate offsets from PTRADD
address computations with the inbounds flag: If the PTRADD does not leave the
bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.
Affected tests:
new positive tests where we still do fold them.
integral to this test, so the test is not changed to make offset folding still
happen.
addresses for accesses to be based on potentially OOB addresses used for
prefetching.
inbounds flag in PTRADD DAGCombines due to the provenance problems pointed out
in PR [SDAG] Preserve InBounds in DAGCombines #165424 and the fact that
AMDGPUTargetLowering::SplitVector{Load|Store}legalizes too-wide accesses byrepeatedly splitting them in half. Legalizing a V32S32 memory accesses
therefore leads to inbounds ptradd chains like (ptradd inbounds (ptradd
inbounds (ptradd inbounds P, 64), 32), 16). The DAGCombines fold them into a
single ptradd, but the involved transformations generally cannot preserve the
inbounds flag (even though it would be valid in this case).
Similar previous PR that relied on
ISD::ADD inboundsinstead ofISD::PTRADD inbounds(closed): #132353Analogous PR for GISel (merged): #153001
Fixes SWDEV-516125.