-
Notifications
You must be signed in to change notification settings - Fork 928
WeeklyTelcon_20210309
- Dialup Info: (Do not post to public mailing list or public wiki)
- Austen Lauria (IBM)
- Brendan Cunningham (Cornelis Networks)
- Christoph Niethammer (HLRS)
- Edgar Gabriel (UH)
- Geoffrey Paulsen (IBM)
- George Bosilca (UTK)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Joseph Schuchart
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Naughton III, Thomas (ORNL)
- Raghu Raja (AWS)
- Ralph Castain (Intel)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia/Mellanox)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Brian Barrett (AWS)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- Hessam Mirsadeghi (UCX/nVidia)
- Howard Pritchard
- Joshua Ladd (nVidia/Mellanox)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Noah Evans (Sandia)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Tomislav Janjusic
- Xin Zhao (nVidia/Mellanox)
- UCX priority PR https://github.com/open-mpi/ompi/pull/8496 merged to master and v4.0.x and v4.1.x
- Do we have a wiki page on how to rerun bots. How to make this more prominent?
- PR commands front and center on wiki: https://github.com/open-mpi/ompi/wiki/PRJenkins
- Jeffs working to get bot checker
- PR 8511 Addresses issues/8321
- Merged to master and release branches
- IBM CI needs to upgrade UCX from 1.8 to 1.9
- Believe we did. Somethings not building, but one compiler is failing.
- Austen is investigating
- Some discussion of WHEN should we call abort in various components?
- Sometimes better to call an MPI error handler
- Sometimes it's unclear what to do
- Ran out of time today for discussion
- we'd like to do that, but not in the middle of about to branch like this.
- Needs to be coordinated.
- Will discuss more in the future (out of time)
- Love to commit this, and then run this across all of code base.
- Overall changes seem reason
- Would have a Github CI to fail if user failed.
- git has hooks for clang --format.
- Could output the diff so a user without clang could fix their PR.
- run by hand for commit, or based on changed files
- Could also run on github commit hook.
- When run this on submodules, need to figure out how to deal with 3rd Party.
- Nice to get this in before v5.0
- Won't do on v4.0.x or v4.1.x
- Probably sort the ordering of includes, but this might point out some breaks.
- Order is based off reg-ex
- The intent would have a 2nd PR that fixes
- Excluding ROMIO
- treematch
- 3rdParty
- Nathan can tell us how to do a pre-commit hook in our local git config.
- Everyone seems in favor of. Will discuss more on ticket, and merge in a few days, or next Tuesday at the latest
- No reason to hold v5.0 branch for this, but want to do this for v5.0.
- Building with autoconf 2.7 had a bunch of warnings / broken
- MacPorts just updated default to autoconf 2.7.1 is now out
- some work to clean up < 2.6 macros has gone into master
- Chrisoph has updated master to just have a few remaining warnings.
- Ralph is following
- works with autoconf 2.7, one mca component might have issues.
- 2.69 -2012
- 2.70 - 2020 (new dev, depricated many things)
- 2.71 - 2021 (8 months - prob quick turn bugfix)
-
32bit? Do we want to continue to support this?
-
https://github.com/open-mpi/ompi/issues/8566
- Broke last week.
- Using an actual 32bit gcc - Compile fail
-
https://github.com/open-mpi/ompi/issues/8566
-
Originally decorated PR with #if 32bit, but passed CI, so decided
-
Rasberry Pi 32bit - George and Jeff is running on this.
- Just hobbiest toy systems.
-
Nathan thinks he might be able to write a compare-and-swap
-
32bit not really production environment, just hobbiest.
-
Production seems to be 64bit.
-
v5.0 - good time to drop 32bit.
- Jeff will send note to packaging, and see if they will care.
- Debian might care.
- Jeff volunteered to
- OSC/RDMA assumed everything was 64bit, but once we changed
- Jeff will ask Absoft
-
On 32bit, if we could use C11 atomics with locks, it might be allowed.
- So perhaps this would be a path.
- Is C11 available on older 32bit systems.
- gcc 6.0+ it should work fine.
-
PMIX v4.1 might be delayed.
- So backup plan is get PRRTE working with PMIx v4.0
- Not sure what we'll lose with PMIx v4.0 instead of v4.1
- Folks should try runng OMPI with PMIx v4.0 Probably release Open-MPI v5.0 with PMIx v4.0
-
Too many Open Issues (50)
- Geoff and Howard will go over v4.0.x issues, and try to close or address some of them.
- Need to label some as wont_fix, let sit for a while, and then close
-
PR 8435 - https://github.com/open-mpi/ompi/pull/8435
- mistake this was targeting v4.1 instead of master.
- Draft PR.
-
UCX Issue 8321,
- We do need to understand what's going on , as there were comments saying we should not support anything older than 1.9.0, but then there was a comment that it's reproducable in 1.9 also
- Is this a UCX problem, or a PML problem?
- We don't know if it's PML or UCX
-
UCX 1.9.0 + OMPI 4.0.4 - Issue 8442
- datatype engine issue
- George has a fix, but it no longer applies cleanly.
- He will try to push, so someone else can
- PR8473 - Sergy pushed a possible fix, but it still failed a CI test, and then closed the PR.
- May not be related to Issue 8321
- We're ready to cut an RC for both 4.1.1 and 4.0.6, these two are blocking.
-
UCX meeting is on Wednesdays
- Howard may go tomorrow.
- UCX community didn't like us configuring out, they're looking into
- It'd be nice to link this to an issue tomorrow.
- Merged all fixes for v4.0.6 except for something to address blocker: https://github.com/open-mpi/ompi/issues/8442
- George still on his plate.
- Two issues (main issue, but also might be an incorrect assert in datatype engine also)
- George: no assert is correct, shouldn't be called in that case.
- blocking on UCX issues (see New topics above) *
-
Before we branch
-
PR 8536 needs another set of eyes from Howard and George. *
-
Ralph almost has singleton comm spawn working
-
Geoff went through most open PRs and many of the newer issues to see if anything would block the branching of v5.0. Discussed these briefly: WeeklyTelcon_20210302-ompiv5-branching
- Look on target to branch next week after AWS GPU Direct PR, and remove CR gets in
-
PRRTE making good progress:
- Ralph resolved about 11 tickets in PRRTE last week. Maybe 20 more
- Then prrte will branch v2.0
- Open-MPI can branch anytime, we'll revisit end of Feb.
-
Raghu, How is GPU Direct RDMA for AWS? Still on track. PR this week.
-
One-sided tests are still busted. Do we keep running these if they're failing?
- Nathan is actively working on, so hopeful we'll get this.
-
Josh summarized discussion from last week in issue.
-
Anything else Josh needs to implement?
- No, Josh will get to before end of month, before v5.0 branches.
-
master configure issue - for v5.0 both of these will need to be fixed.
- Luster configure option, Edger sees it, but no idea how to fix it.
- Not sure if he should open an issue. Ralph thinks Giles fixed. Edger will give it a try
- SharedFP component, Edger opened an issue this morning.
- Blocker for v5.0
-
ECP Community days ( March 30-April 1st )
- David Bernholdt and/or George Bosilica
- Each day 90 minute time slots.
- Get proposal in by this Friday.
- Tuesday March 30th from 1-2:30pm (US Eastern)
- Invited some people to speak. They will be our main community speakers.
- Anyone on OMPI community can send slides to Jeff and George
- Due Friday March 26th
- PMIx Wed 31st 11 - 12:30 (US Eastern)
-
Discuss for v5.0
- Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
- One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
- Non-Homogenous clusters (GPUs on some nodes, and non-GPUs on some other)
- PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
- Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
- Intent this is for v5.0
- mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
- Ralph has asked about this for PMIx/PRRTE since this is turning out to work
- What do we want to do about ROMIO in general.
- OMPIO is the default everywhere.
- Giles is saying the changes we made are integration changes.
- There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
- We may be able to work with upstream to make a clear API between the two.
- As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
- Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
- Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
How's the state of https://github.com/open-mpi/ompi-tests-public/
- Putting new tests there
- Very little there so far, but working on adding some more.
- Should have some new Sessions tests
- what is being reported looks pretty good.
- ppc atomics - Austen has been looking at this
- Intercomm Merge is getting inconsistant ordering of procs.
- What is the priority of this?
- Many of the ibm tests start off by doing some intercomm manipulation.
- Won't get
- Mellanox MTT had been failing. Boris set some debug, and they unplugged it.
-
They plan to re-enable it tomorrow.
-