-
Notifications
You must be signed in to change notification settings - Fork 38
CE: Added pages with guidelines for images on Alps #272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
preview available: https://docs.tds.cscs.ch/272 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
preview available: https://docs.tds.cscs.ch/272 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
preview available: https://docs.tds.cscs.ch/272 |
This comment has been minimized.
This comment has been minimized.
preview available: https://docs.tds.cscs.ch/272 |
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
preview available: https://docs.tds.cscs.ch/272 |
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
I'll try to come back to this for a proper review in the next ~week. If someone else wants to look in the mean time, please go ahead. A question that came up while skimming through the changes is how the performance results in this PR relate to the changes in #262 and whether we should duplicate them? Another minor note would be to add links to the "Communication libraries" sections, since they contain some useful info about environment variables etc. (sorry if I missed this and you already added them). Finally, for the spell checker please add the remaining words to |
preview available: https://docs.tds.cscs.ch/272 |
Thanks for the tip about the spelling checker @msimberg. Regarding references to the "Communication libraries" section, I didn't add anything at this time. |
I think it's useful to document what we set anyway for at least two reasons:
The downside is that whatever we document here may be out of sync with container engine hooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful! I left a few comments.
docs/software/container-engine/guidelines-images/image-comm-fwk.md
Outdated
Show resolved
Hide resolved
- Ubuntu 24.04 | ||
- CUDA 12.8.1 | ||
- GDRCopy 2.5.1 | ||
- Libfabric 1.22.0 | ||
- UCX 1.19.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bound out of date, at least temporarily. Should we provide some instructions to the user on how to retrieve this information from the container or registry instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also a bit redundant with versions explicitly set below.
docs/software/container-engine/guidelines-images/image-comm-fwk.md
Outdated
Show resolved
Hide resolved
docs/software/container-engine/guidelines-images/image-comm-fwk.md
Outdated
Show resolved
Hide resolved
- Ubuntu 24.04 | ||
- CUDA 12.8.1 | ||
- GDRCopy 2.5.1 | ||
- Libfabric 1.22.0 | ||
- UCX 1.19.0 | ||
- MPICH 4.3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
## Contents | ||
|
||
- Ubuntu 24.04 | ||
- CUDA 12.8.1 (includes NCCL) | ||
- GDRCopy 2.5.1 | ||
- Libfabric 1.22.0 | ||
- UCX 1.19.0 | ||
- OpenMPI 5.0.8 | ||
- NVSHMEM 3.4.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
docs/software/container-engine/guidelines-images/image-nvshmem.md
Outdated
Show resolved
Hide resolved
## Contents | ||
|
||
- Ubuntu 24.04 | ||
- CUDA 12.8.1 | ||
- GDRCopy 2.5.1 | ||
- Libfabric 1.22.0 | ||
- UCX 1.19.0 | ||
- OpenMPI 5.0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
preview available: https://docs.tds.cscs.ch/272 |
preview available: https://docs.tds.cscs.ch/272 |
preview available: https://docs.tds.cscs.ch/272 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes present some excellent information for our users (and for CSCS staff).
The docs as structured are not integrated with the rest of the documentation, namely
- the library-spcific content can move to the appropriate section of "Applications and Frameworks"
- links can be provided to existing docs (e.g. where the CXI hook is mentioned, make a link to the hook docs)
Documentation on how to compiled software, programming environments, etc are provided in the Applications & Frameworks
part of the documentation https://docs.cscs.ch/software/
Like @msiberg pointed out, the logical location for this material is in the Applications&Frameworks->Communication libraries
section.
Otherwise users are less likely to find these pages, and will be confused by having the same thing documentated in two locations.
For example, there is an existing MPICH page that explicitly covers how to create a container (the use case was users of the containerised CI/CD tool).
https://docs.cscs.ch/software/communication/mpich/
I think it is easiest to have a call, to speed up the process of determining what to do.
As agreed in VCUE, I have updated a set of images with foundational resources (CUDA, MPI, NCCL, NVSHMEM), deriving them from material I use myself, and demonstrated how to run them through the CE on Alps.
The intent of this material is to offer guidelines and suggestions about versions, building, and running of foundational components to use on Alps, without committing to officially supported resources.