Conversation
|
@Emrys-Merlin great contribution Tim :-) |
|
Getting some test failures with this env on AMD It could all be expected numerics, unclear. This is the chip update Looking in a more detailed way, the test_triangular_multiplicative_update.py update seems fine/minimal drift The other two (both for test_triangular_attention.py) look more severe: |
|
Thanks a lot for testing this @jandom! I really appreciate it :-) I think I count it as a win that the tests ran at all :-D I agree that some of the numerical differences warrant deeper inspection. I'm open to support here, but I am a bit handicapped without access to AMD GPUs. If it is easy for you to share limited access with me to debug this, that could speed up things a bit. I will continue looking for an internal solution. I will be on vacation next week. So, I won't be very responsive. If we don't find a solution until Barcelona, I'm happy to chat there :-) |
|
No worries, I've shared this ticket with Gagan already – he might be able to come in and help |
Summary
This PR introduces a ROCm pixi-environment called
openfold3-rocm7in line with the cpu/cuda12/cuda13 environments. This unifies the usage pattern of openfold3 after the migration to the pixi package manger.Changes
Related Issues
I tried to build the environment on our HPC cluster, but our proxy interfered with the resolution of the pytorch dependency. @sdvillal thankfully already opened an issue about that with the pixi developers, so hopefully this will be resolved soon. I spun up an AWS EC2 instance where the resolution worked without any issues.
Testing
I could only test that the environment resolves as I do not have access to an AMD accelerator. @singagan if you could help me out here, that would be highly appreciated :-)
The current output of the
validate-openfold3-rocmcommand is as follows:$ pixi run -e openfold3-rocm7 validate-openfold3-rocm OpenFold3 ROCm environment check [PASS] PyTorch installed: 2.11.0+rocm7.2 [PASS] PyTorch built with ROCm (HIP): 7.2.26015 [FAIL] ROCm GPU visible: none [PASS] Triton installed: 3.6.0 [FAIL] Triton backend is HIP: 0 active drivers ([]). There should only be one. [PASS] Triton evoformer kernel loaded One or more checks failed. See above for details. Installation instructions: https://github.com/aqlaboratory/openfold-3/blob/main/docs/source/Installation.mdOther Notes
Note, as we need to pull pytorch from PyPI, we pull almost all dependencies from PyPI and not from conda-forge. This is necessary, because if any one of our dependencies were to pull pytorch from conda-forge, this would supersede our PyPI pytorch request and we would end up with a pytorch version without ROCm support. This is a known pixi limitation. If it gets resolved, we could think about pulling more of the dependencies from conda-forge, but this is optional and not a blocker.
@sdvillal, I would love to get your feedback. The environment setup is rather complex and I'm not completely convinced I assembled the rocm environment correctly (or if I pulled in unnecessary features).
@jnwei @jandom As discussed in #166, this is the draft to enable ROCm in the pixi setup.