Skip to content

Conversation

Madeeks
Copy link
Member

@Madeeks Madeeks commented Sep 30, 2025

As agreed in VCUE, I have updated a set of images with foundational resources (CUDA, MPI, NCCL, NVSHMEM), deriving them from material I use myself, and demonstrated how to run them through the CE on Alps.

The intent of this material is to offer guidelines and suggestions about versions, building, and running of foundational components to use on Alps, without committing to officially supported resources.

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

Copy link

preview available: https://docs.tds.cscs.ch/272

This comment has been minimized.

This comment has been minimized.

Copy link

preview available: https://docs.tds.cscs.ch/272

This comment has been minimized.

This comment has been minimized.

Copy link

preview available: https://docs.tds.cscs.ch/272

This comment has been minimized.

Copy link

preview available: https://docs.tds.cscs.ch/272

This comment has been minimized.

1 similar comment

This comment has been minimized.

Copy link

preview available: https://docs.tds.cscs.ch/272

This comment has been minimized.

1 similar comment

This comment has been minimized.

@msimberg
Copy link
Contributor

msimberg commented Oct 1, 2025

I'll try to come back to this for a proper review in the next ~week. If someone else wants to look in the mean time, please go ahead.

A question that came up while skimming through the changes is how the performance results in this PR relate to the changes in #262 and whether we should duplicate them?

Another minor note would be to add links to the "Communication libraries" sections, since they contain some useful info about environment variables etc. (sorry if I missed this and you already added them).

Finally, for the spell checker please add the remaining words to .github/actions/spelling/allow.txt (with all caps = exact match, lower case = any capitalization; I think the former is appropriate in this case).

Copy link

github-actions bot commented Oct 1, 2025

preview available: https://docs.tds.cscs.ch/272

@Madeeks
Copy link
Member Author

Madeeks commented Oct 1, 2025

Thanks for the tip about the spelling checker @msimberg.

Regarding references to the "Communication libraries" section, I didn't add anything at this time.
I'm absolutely open to do it, but what I would find most useful is to find an agreement between which settings should we embed directly in hooks (therefore becoming transparent to users, like a bunch of NCCL-related vars in the AWS OFI NCCL hook) and which we leave users to handle directly (e.g. in the EDF).

@msimberg
Copy link
Contributor

msimberg commented Oct 3, 2025

what I would find most useful is to find an agreement between which settings should we embed directly in hooks (therefore becoming transparent to users, like a bunch of NCCL-related vars in the AWS OFI NCCL hook) and which we leave users to handle directly (e.g. in the EDF).

I think it's useful to document what we set anyway for at least two reasons:

  • it serves as a useful reference to reproduce results manually and/or on other systems and makes it less "magic" (you actually see what's being set, without having to inspect the environment)
  • uenv users don't have hooks to set the environment variables (there's been discussion about this, but I don't think we have a good solution yet; should they be baked into the uenv, what happens when recommendations change etc.?)

The downside is that whatever we document here may be out of sync with container engine hooks.

Copy link
Member

@RMeli RMeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful! I left a few comments.

Comment on lines +22 to +26
- Ubuntu 24.04
- CUDA 12.8.1
- GDRCopy 2.5.1
- Libfabric 1.22.0
- UCX 1.19.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bound out of date, at least temporarily. Should we provide some instructions to the user on how to retrieve this information from the container or registry instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also a bit redundant with versions explicitly set below.

Comment on lines +13 to +18
- Ubuntu 24.04
- CUDA 12.8.1
- GDRCopy 2.5.1
- Libfabric 1.22.0
- UCX 1.19.0
- MPICH 4.3.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Comment on lines +11 to +19
## Contents

- Ubuntu 24.04
- CUDA 12.8.1 (includes NCCL)
- GDRCopy 2.5.1
- Libfabric 1.22.0
- UCX 1.19.0
- OpenMPI 5.0.8
- NVSHMEM 3.4.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Comment on lines +11 to +18
## Contents

- Ubuntu 24.04
- CUDA 12.8.1
- GDRCopy 2.5.1
- Libfabric 1.22.0
- UCX 1.19.0
- OpenMPI 5.0.8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Copy link

github-actions bot commented Oct 6, 2025

preview available: https://docs.tds.cscs.ch/272

Copy link

github-actions bot commented Oct 6, 2025

preview available: https://docs.tds.cscs.ch/272

Copy link

github-actions bot commented Oct 6, 2025

preview available: https://docs.tds.cscs.ch/272

Copy link
Member

@bcumming bcumming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes present some excellent information for our users (and for CSCS staff).

The docs as structured are not integrated with the rest of the documentation, namely

  • the library-spcific content can move to the appropriate section of "Applications and Frameworks"
  • links can be provided to existing docs (e.g. where the CXI hook is mentioned, make a link to the hook docs)

Documentation on how to compiled software, programming environments, etc are provided in the Applications & Frameworks part of the documentation https://docs.cscs.ch/software/

Like @msiberg pointed out, the logical location for this material is in the Applications&Frameworks->Communication libraries section.
Otherwise users are less likely to find these pages, and will be confused by having the same thing documentated in two locations.
For example, there is an existing MPICH page that explicitly covers how to create a container (the use case was users of the containerised CI/CD tool).
https://docs.cscs.ch/software/communication/mpich/

I think it is easiest to have a call, to speed up the process of determining what to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants