Run parallel distributed tests in CI #832

mcgibbon · 2026-02-12T22:13:29Z

This PR adds some non-trivial tests of data and model parallelism.

The implementations are trivial for the non-model-parallel case, but these tests will provide good coverage of the upcoming model-parallel code. This PR makes it so that the infrastructure is in place to use these tests and add to them.

Changes:

Added get_local_slices, data_parallel_rank and total_data_parallel_ranks attributes to Distributed and DistributedBackend classes, which are currently used only in unit tests but will be needed for spatial parallelism.
Tests added

…feature/test_distributed

elynnwu · 2026-02-12T23:44:51Z

fme/core/distributed/distributed.py

+            gathered_global = torch.zeros(
+                *global_shape, dtype=tensor.dtype, device=tensor.device
+            )


Might be a spatial parallelism PR question, do we know how often we are calling gather_global? If allocating 2x the global tensor becomes a memory issue, we may want to consider passing a pre-allocated buffer to the gather call and use it in-place, that way we don't hold both the temporary gathered and gathered_global at the same time. I don't know if this is an issue, but something we might want to consider in the future.

elynnwu · 2026-02-12T23:46:15Z

Makefile

 	pytest --durations 40 .

+test_parallel:
+	torchrun --nproc-per-node 2 -m pytest ./fme/core/distributed/parallel_tests


[nit] make the number of processor a Make variable so you can use a different number when calling make test_parallel

mcgibbon added 7 commits February 12, 2026 21:33

add get_local_slices and test

cf02690

move gather onto Distributed

36e4ed0

add test of reduce_mean

36b1835

run parallel tests in CI

755cb05

Merge branch 'main' into feature/test_distributed

682cc79

remove unnecessary __init__.py

1c6366f

Merge branch 'feature/test_distributed' of github.com:ai2cm/ace into …

01e71c3

…feature/test_distributed

elynnwu approved these changes Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run parallel distributed tests in CI #832

Run parallel distributed tests in CI #832

mcgibbon commented Feb 12, 2026

Uh oh!

elynnwu Feb 12, 2026

Uh oh!

elynnwu Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Run parallel distributed tests in CI #832

Are you sure you want to change the base?

Run parallel distributed tests in CI #832

Conversation

mcgibbon commented Feb 12, 2026

Uh oh!

elynnwu Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

elynnwu Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants