Skip to content

Conversation

@ktf
Copy link
Member

@ktf ktf commented Mar 17, 2025

No description provided.

@ktf ktf requested review from a team as code owners March 17, 2025 13:39
@ktf ktf mentioned this pull request Mar 17, 2025
${LIBJALIENO2_ROOT:+-DlibjalienO2_ROOT=$LIBJALIENO2_ROOT} \
${XROOTD_REVISION:+-DXROOTD_DIR=$XROOTD_ROOT} \
${JALIEN_ROOT_REVISION:+-DJALIEN_ROOT_ROOT=$JALIEN_ROOT_ROOT} \
${ALIBUILD_O2_FORCE_GPU:+-DENABLE_CUDA=ON -DENABLE_HIP=ON -DENABLE_OPENCL=ON} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to revert some of my recent changes, I think you rebased some collision incorrectly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a look again. When did you merge them?

@ktf ktf force-pushed the pr5793 branch 2 times, most recently from 1a30bba to 0e38c22 Compare March 17, 2025 13:55
@davidrohr
Copy link
Contributor

I meant this:

commit 9a9e4396c3850fde151ff38350db1bc155dd42bc
Author: David Rohr <drohr@jwdt.org>
Date:   Fri Feb 21 21:36:25 2025 +0100

    O2 GPU Build: remove copy & paste and OpenCL1, adapt OpenCL2 options (#5771)

@ktf
Copy link
Member Author

ktf commented Mar 17, 2025

Ok, this misses David's patch, however now the dependencies for ONNXRuntime seem to be fully under control thanks to:

      -DFETCHCONTENT_FULLY_DISCONNECTED=ON                                                                  \
       -DFETCHCONTENT_QUIET=OFF                                                                              \
       -DCMAKE_POLICY_DEFAULT_CMP0170=NEW                                                                    \
       -DFETCHCONTENT_TRY_FIND_PACKAGE_MODE=ALWAYS                                                           \

notice there is the need for an additional "out of band" flatc invocation to have it work with our own version of flatbuffers.

@davidrohr
Copy link
Contributor

@ktf : Should we take one of the build container locally, and build it and compile locally to get this through?

@ktf
Copy link
Member Author

ktf commented Mar 18, 2025

I thought we had the whole HIP coming from the system, while apparently there is some extra bits on top. Working on it.

@davidrohr
Copy link
Contributor

All CUDA / ROCm should come from the system. Where do you see some extra bits on top? Clearly ONNX builds some GPU stuff on top, but that should be ONNX only.
But in any case, SLC9 without GPU, ubuntu, and Mac CIs are all failing due to Eigen.
I think I would fix the non FullCI builders first, then we can look for the GPU stuff.

@ktf
Copy link
Member Author

ktf commented Mar 18, 2025

As discussed privately, I am checking the eigen issue. In principle it should have worked.

There is also:

CMake Error at /sw/slc9_x86-64/CMake/v3.28.1-13/share/cmake-3.28/Modules/CMakeFindDependencyMacro.cmake:76 (find_package):
  By not providing "Findhipblaslt.cmake" in CMAKE_MODULE_PATH this project
  has asked CMake to find a package configuration file provided by
  "hipblaslt", but CMake did not find one.

  Could not find a package configuration file provided by "hipblaslt" with
  any of the following names:

    hipblasltConfig.cmake
    hipblaslt-config.cmake

which is indeed not part of HIP, just has HIP in the name.

@ktf
Copy link
Member Author

ktf commented Mar 18, 2025

The problem with Eigen is merely v1.21.0 needing an extra commit to support the Eigen which actually compiles correctly.
Worked around with:

https://github.com/alisw/alidist/pull/5793/files#diff-db437a71aac29a850f979001c1e5b5193aaec6113149992588702c36025f8046R97

unless they release v1.21.1 soon, we should probably fork and patch.

So to my understanding there is only two things remaining:

Working around the above I get to the end of it on my box.

@ktf ktf force-pushed the pr5793 branch 2 times, most recently from 4f4a1f5 to 3ed119c Compare March 18, 2025 14:44
@ktf ktf mentioned this pull request Mar 19, 2025
@ktf
Copy link
Member Author

ktf commented Mar 19, 2025

I just fixed a couple of extra bits which make it compile on the EPN.

@ktf ktf force-pushed the pr5793 branch 2 times, most recently from 2bf0c3e to 4b1deb6 Compare March 19, 2025 14:18
knopers8 added a commit to knopers8/Control that referenced this pull request Mar 19, 2025
Since the new abseil in alisw/alidist#5793 exposes c++20 in headers, we have to bump accordingly.
@ChSonnabend
Copy link
Collaborator

Houston, we have lift off
image

@ktf ktf force-pushed the pr5793 branch 3 times, most recently from 8786249 to 184b0a5 Compare March 22, 2025 22:38
@ktf
Copy link
Member Author

ktf commented Mar 23, 2025

There is now a bunch of QC tests that fail.


## sw/BUILD/QualityControl-latest/log
 1/28 Test  #1: o2-qc-test-core ..................Subprocess aborted***Exception:   0.55 sec
 2/28 Test  #3: testPublisher ....................Subprocess aborted***Exception:   0.56 sec
 3/28 Test  #4: testQcInfoLogger .................Subprocess aborted***Exception:   0.56 sec
 4/28 Test  #5: testTaskRunner ...................Subprocess aborted***Exception:   0.55 sec
 5/28 Test  #6: testObjectsManager ...............Subprocess aborted***Exception:   0.55 sec
 6/28 Test  #9: testTriggers .....................Subprocess aborted***Exception:   0.55 sec
 7/28 Test #10: testPostProcessingInterface ......Subprocess aborted***Exception:   0.55 sec
 8/28 Test #11: testPostProcessingConfig .........Subprocess aborted***Exception:   0.56 sec
 9/28 Test #12: testReductor .....................Subprocess aborted***Exception:   0.55 sec
10/28 Test #13: testCheckWorkflow ................Subprocess aborted***Exception:   0.55 sec
11/28 Test #14: testWorkflow .....................Subprocess aborted***Exception:   0.56 sec
12/28 Test #15: testRepoPathUtils ................Subprocess aborted***Exception:   0.57 sec
13/28 Test #17: testStringUtils ..................Subprocess aborted***Exception:   0.58 sec
14/28 Test #18: testRunnerUtils ..................Subprocess aborted***Exception:   0.56 sec
15/28 Test #19: testBookkeepingQualitySink .......Subprocess aborted***Exception:   0.55 sec
16/28 Test #20: functional_test ..................***Failed    0.88 sec
17/28 Test #21: multinode_test ...................***Failed    1.04 sec
18/28 Test #22: batch_test .......................***Failed    0.78 sec
19/28 Test #23: testMeanIsAbove ..................Subprocess aborted***Exception:   0.55 sec
20/28 Test #24: testNonEmpty .....................Subprocess aborted***Exception:   0.55 sec
21/28 Test #25: testCommonReductors ..............Subprocess aborted***Exception:   0.57 sec
22/28 Test #26: testCommonHistRatios .............Subprocess aborted***Exception:   0.59 sec
23/28 Test #27: testWorstOfAllAggregator .........Subprocess aborted***Exception:   0.58 sec
24/28 Test #28: testQcDaq ........................Subprocess aborted***Exception:   0.57 sec
25/28 Test #29: testFactory ......................Subprocess aborted***Exception:   0.58 sec
26/28 Test #30: testQcExample ....................Subprocess aborted***Exception:   0.58 sec
27/28 Test #31: testQcSkeleton ...................Subprocess aborted***Exception:   0.60 sec
28/28 Test #32: testQcTOF ........................Subprocess aborted***Exception:   0.58 sec
0% tests passed, 28 tests failed out of 28

will have a look tomorrow.

@ktf
Copy link
Member Author

ktf commented Mar 25, 2025

I think the issue with QC comes from compiling gRPC as archive library. Changing that back.

@ktf
Copy link
Member Author

ktf commented Mar 26, 2025

So indeed the issue with QC seems to be corrected by going back to a shared library for gRPC. That's because if we don't some singleton gets initialised twice. I also had to add one more library for mesos due to that change. @singiamtel is checking the container. Apart from that I am back to believing this is ready to go.

@davidrohr
Copy link
Contributor

@ChSonnabend : Could you have a look? It has the hipblaslt headers available now, but not migraphx fails compilation with

FAILED: CMakeFiles/onnxruntime_providers_migraphx.dir/sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/gpu_data_transfer.cc.o.ddi 
/sw/slc9_x86-64/GCC-Toolchain/v14.2.0-alice2-1/bin/c++ -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DENABLE_CPU_FP16_TRAINING_OPS -DHAS_STRING_VIEW=1 -DMIOPEN_VERSION=30300 -DONLY_C_LOCALE=0 -DONNXIFI_BUILD_LIBRARY=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx -DORT_ENABLE_STREAM -DPLATFORM_POSIX -DROCM_VERSION=60302 -DUSE_MIGRAPHX=1 -DUSE_PROF_API=1 -DUSE_ROCM=1 -D_GNU_SOURCE -D__HIP_PLATFORM_AMD__=1 -Donnxruntime_providers_migraphx_EXPORTS -I/sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/include/onnxruntime -I/sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/include/onnxruntime/core/session -I/sw/slc9_x86-64/pytorch_cpuinfo/alice1-local1/include -I/sw/BUILD/112ce1af1ada10d2518752323b7d3f23799fee73/ONNXRuntime -I/sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime -I/sw/slc9_x86-64/safe_int/v3.0.28a-local1/include -I/sw/slc9_x86-64/ms_gsl/4.0.0-17/include -I/sw/slc9_x86-64/date/v3.0.3-local1/include -I/sw/BUILD/112ce1af1ada10d2518752323b7d3f23799fee73/ONNXRuntime/amdgpu/onnxruntime -isystem /sw/slc9_x86-64/abseil/20240722.0-local1/include -isystem /sw/slc9_x86-64/onnx/v1.17.0-local1/include -isystem /sw/slc9_x86-64/protobuf/v29.3-local1/include -isystem /sw/slc9_x86-64/flatbuffers/v24.3.25-8/include -isystem /sw/slc9_x86-64/boost/v1.83.0-alice2-34/include -isystem /opt/rocm/include -isystem /opt/rocm-6.3.2/include -fPIC -O2 -std=c++20 -Wno-unknown-warning -Wno-unknown-warning-option -Wno-pass-failed -Wno-error=unused-but-set-variable -Wno-pass-failed=transform-warning -Wno-error=deprecated -Wno-error=maybe-uninitialized -Wno-error=deprecated-enum-enum-conversion -Wno-error -Wno-error=missing-requires -w -ffunction-sections -fdata-sections -DCPUINFO_SUPPORTED -Wno-unused-parameter -O3 -DNDEBUG -fPIC -Wno-deprecated-declarations -Wall -Wextra -Wno-deprecated-copy -Wno-nonnull-compare -w -Wno-error=sign-compare -Werror -E -x c++ /sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/gpu_data_transfer.cc -MT CMakeFiles/onnxruntime_providers_migraphx.dir/sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/gpu_data_transfer.cc.o.ddi -MD -MF CMakeFiles/onnxruntime_providers_migraphx.dir/sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/gpu_data_transfer.cc.o.ddi.d -fmodules-ts -fdeps-file=CMakeFiles/onnxruntime_providers_migraphx.dir/sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/gpu_data_transfer.cc.o.ddi -fdeps-target=CMakeFiles/onnxruntime_providers_migraphx.dir/sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/gpu_data_transfer.cc.o -fdeps-format=p1689r5 -o CMakeFiles/onnxruntime_providers_migraphx.dir/sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/gpu_data_transfer.cc.o.ddi.i
In file included from /sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/migraphx_inc.h:8,
                 from /sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/gpu_data_transfer.h:6,
                 from /sw/SOURCES/ONNXRuntime/v1.21.0/v1.21.0/onnxruntime/core/providers/migraphx/gpu_data_transfer.cc:5:
/opt/rocm/include/migraphx/migraphx.hpp:1234:5: error: module control-line cannot be in included file
 1234 |     module get_main_module()
      |     ^~~~~~
/opt/rocm/include/migraphx/migraphx.hpp:1248:5: error: module control-line cannot be in included file
 1248 |     module create_module(const std::string& name)
      |     ^~~~~~

We could also disable migraphx at first, and solve that later.

@ChSonnabend
Copy link
Collaborator

I would disable migraphx for now then and merge as it is. I haven't used the migraphx API calls myself yet, I can make a separate PR once I know what the issue is

lkrcal
lkrcal previously approved these changes Mar 27, 2025
@ktf ktf merged commit 8084883 into alisw:master Mar 27, 2025
9 of 12 checks passed
@ktf
Copy link
Member Author

ktf commented Mar 27, 2025

Merging since this was already tested. Will start a build now so that we gain a few hours for tomorrow.

# Our environment
set ${PKGNAME}_ROOT \$::env(BASEDIR)/$PKGNAME/\$version
prepend-path ROOT_INCLUDE_PATH \$${PKGNAME}_ROOT/include/onnxruntime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this taken out? I believe this change leads to crashes of all MC jobs since 4 days. See here: #5793

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants