Pipeline parallelism continued #399

BlueCrescent · 2025-09-17T16:23:58Z

What does this PR do?

Issue: Dicts, as used in modalities as model/loss inputs and outputs, are not supported in the torch pp code.
- Reason: Direct GPU to GPU communication
- See: Support dict as input/output for pipeline parallelism pytorch/pytorch#159711
- Workaround: Overloaded corresponding forward() and call() methods to allow direct tensor input and output.
Issue: Loss accumulation (for logging) needs to consider pp world size instead of total world size
- Solution:
  - For the averaged loss we only increment the batch count on the last stages of a PP schedule.
  - For the current batch loss we added the pp world size, computed from the device mesh, to the Trainer class.
Issue: Due to a Bug in PyTorch we cannot evaluate before training begin.
- Workaround: Disabled eval before training
- Solution: Issue and PR opened for PyTorch 2.9
- See: Pipeline parallelism does not correctly initialize backwards stages when evaluating before training. pytorch/pytorch#162822
Issue: Normal distributed dataloader/sampler use would loose batches
- Solution: Use (Resumable)DistributedMultiDimSamplerConfig
- TODO: Implement DistributedMultiDimSamplerConfig for eval dataloader
- TODO??: Remove data_parallel_key param from SamplerFactory.create_resumable_distributed_multi_dim_sampler()
  - Instead use BOTH ParallelismDegrees.DP_REPLICATE and ParallelismDegrees.DP_SHARD (multiplied)
Issue: Gradient clipping needs to sync over all stages
- Solution: Implemented same procedure using all_reduce() as in TorchTitan.
- Torch Issue: Gradient Clipping in Pipeline Parallelism Schedules pytorch/pytorch#162638
- TorchTitan implementation: https://github.com/pytorch/torchtitan/blob/b291ad662493b63d25b038a30a915082d3617baf/torchtitan/distributed/utils.py#L298
Issue: Instead of calling forward() and backward() on the model and executing the loss_fct(), PP requires to call step() on the pipeline schedule.
- Solution: Integrated scheduled pipeline into Trainer and Evaluator
Issue: Want to run evaluation with torch.no_grad() and without doing a backwards() pass in the pipeline schedule step().
- Solution: This is only supported from PyTorch 2.9 (current nightly) on. There the pipeline schedule gets a eval() method to be used instead of step().
- ⚠️Warning⚠️: Consequently, we can only use PP in training with PyTorch 2.9 Nightly installed.
Note: Weight tying is currently not supported and probably not possible with PP.
Question: Do we need to consider something regarding model init so that all corresponding model copies are initialized the same?
- I.e.: How can we assure that parallel stages are initialized the same?
- Other way around: Will stages containing similar model parts be initialized the same?
- Answer: Since we first build that PP stages and then the FSDP2 parallelization and then init, this is fine.
- Follow up question: What happens, if we use full replica data parallelization?
Issue: Want to compare non-PP forward pass with PP forward pass in unit test
- Goal: Compare losses
- Solution: Added a non-PP config used to compute a comparison loss
- Issue: Need non-PP forward pass loss on all ranks that contain a last stage of the PP forward pass.
  - Solution: Run a normal FSDP2 forward pass on all ranks and ignore the output on non-last stage ranks.
  - Remark: Running the FSDP2 pass only on the last stage ranks led to hanging. This might be a known issue with FSDP2.
- Issue: Need the model to effectively compute the same forward pass.
  - Solution:
    - In the test config, run model model initialization before sharding/staging (don't do this in a real training!).
    - Fix torch seed before each initialization.
    - Fix torch seed before generating each input sequence. Note that we thus have identical data parallel batches.
- TODO: Numerical instability can still be observed and needs to be investigated further (in FSDP only runs as well).
TP + PP
- Issue: What is the correct sharding/staging order?
  - Solution: 1. PP 2. TP 3. FSDP
- Issue: TP initialization modifies the layers of the model and runs into problems if those have been deleted for PP.
  - Solution: Adapted TP code to be able to handle missing/None layers.
- Issue: PP + TP test hangs on first stages, first microbatches at the start of attention.
  - Solution: The test used sequence length 255. It seems, this not being divisible by two caused issues with TP's sequence parallelism.
  - ⚠️Warning⚠️: This needs to be kept in mind when using PP + TP.
TODO: Wherever "dist.get_world_size()" is used, check if we should use number of data parallel ranks instead.
TODO: MFU is 0 on dgx2, but also without PP
TODO: Integrate and test other pipeline schedules.
TODO: Check if checkpointing still works.

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

… classes for pipeline parallelism.

Copilot

Pull Request Overview

This PR implements pipeline parallelism support in the Modalities framework by addressing PyTorch's lack of native support for dict-based model inputs/outputs and implementing proper loss accumulation, gradient clipping, and data loading for pipeline parallel training.

Key changes include:

Overloaded model forward() methods to support both dict and tensor inputs for pipeline parallelism compatibility
Updated gradient clippers to sync across pipeline stages and support device mesh configurations
Modified trainer and evaluator to integrate with pipeline schedules instead of direct model calls
Added proper loss accumulation using data parallel world size instead of total world size

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/fsdp2_parallelization/pipeline_parallelism/test_pp_fwd_bwd_pass.py	Adds comprehensive test comparing PP vs non-PP forward passes with loss validation
tests/fsdp2_parallelization/pipeline_parallelism/configs/config_lorem_ipsum_long_fsdp2_fwd_bwd_pass.yaml	New FSDP2 config for PP testing without pipeline stages
src/modalities/models/gpt2/gpt2_model.py	Overloads forward() method to accept both dict and tensor inputs for PP compatibility
src/modalities/training/gradient_clipping/fsdp_gradient_clipper.py	Implements cross-stage gradient norm synchronization for pipeline parallelism
src/modalities/trainer.py	Integrates pipeline schedule execution and fixes loss accumulation for PP
src/modalities/loss_functions.py	Overloads loss function to handle both InferenceResultBatch and tensor inputs
src/modalities/evaluator.py	Adds pipeline schedule support to evaluation process
src/modalities/models/model_factory.py	Fixes tensor parallelism to handle missing model layers in PP stages

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/fsdp2_parallelization/pipeline_parallelism/test_pp_fwd_bwd_pass.py

src/modalities/models/gpt2/gpt2_model.py

src/modalities/training/gradient_clipping/fsdp_gradient_clipper.py

src/modalities/trainer.py

Copilot · 2025-09-17T16:25:40Z

src/modalities/loss_functions.py

+    def __call__(self, outputs: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
+        ...
+
+    def __call__(self, *args, **kwargs) -> torch.Tensor:


Using *args and **kwargs instead of proper overloads makes the API less type-safe and harder to understand. Consider implementing proper method overloading with specific parameter types.

src/modalities/gym.py

…es/modalities into pipeline_parallelism_fix

le1nux

Awesome work with the PP integration! Functionality-wise everything looked correct (did not check the tests yet).

Regarding the integration from a architectural perspetive, I left a couple of comments. I think we should do some refactorings here.

le1nux · 2025-09-26T07:47:17Z

src/modalities/main.py

            gradient_clipper=components.gradient_clipper,
            global_num_tokens_per_train_step=global_num_tokens_per_train_step,
            mfu_calculator=components.mfu_calculator,
+            num_pipeline_parallel_ranks=num_pipeline_parallel_ranks,


I would prefer if we kept the Trainer high level and abstract away specifics like PP.

le1nux · 2025-09-26T07:48:22Z

src/modalities/main.py

            checkpointing_interval_in_steps=components.settings.intervals.checkpointing_interval_in_steps,
            evaluation_interval_in_steps=components.settings.intervals.evaluation_interval_in_steps,
            training_log_interval_in_steps=components.settings.intervals.training_log_interval_in_steps,
+            scheduled_pipeline=components.scheduled_pipeline if components.scheduled_pipeline else None,


Same point as for the trainer. Could we wrap the scheduled pipeline instead and use the existing model interfaces?

le1nux · 2025-09-26T08:41:31Z

src/modalities/training/gradient_clipping/fsdp_gradient_clipper.py

+        pp_mesh = get_mesh_for_parallelism_method(device_mesh=device_mesh, parallelism_method=ParallelismDegrees.PP)
+        if pp_mesh is not None:
+            if math.isinf(norm_type):
+                dist.all_reduce(total_norm, op=dist.ReduceOp.MAX, group=pp_mesh.get_group())
+            else:
+                total_norm **= norm_type
+                dist.all_reduce(total_norm, op=dist.ReduceOp.SUM, group=pp_mesh.get_group())
+                total_norm **= 1.0 / norm_type
+
+        torch.nn.utils.clip_grads_with_norm_(parameters, max_norm, total_norm, foreach)


we should have a test for this

le1nux · 2025-09-26T08:41:52Z

src/modalities/training/gradient_clipping/fsdp_gradient_clipper.py

+
+        pp_mesh = get_mesh_for_parallelism_method(
+            device_mesh=self.device_mesh, parallelism_method=ParallelismDegrees.PP
+        )
+        if pp_mesh is not None:
+            if math.isinf(self.norm_type.value):
+                dist.all_reduce(total_norm, op=dist.ReduceOp.MAX, group=pp_mesh.get_group())
+            else:
+                total_norm **= self.norm_type.value
+                dist.all_reduce(total_norm, op=dist.ReduceOp.SUM, group=pp_mesh.get_group())


duplicated code

le1nux · 2025-10-16T14:25:08Z

config_files/training/config_lorem_ipsum_long_fsdp2_pp.yaml

    pp_schedule_name: gpipe
    batch_size: ${settings.step_profile.local_train_micro_batch_size}
-    microbatch_size: 1
+    microbatch_size: 2


should we reference this from the top?

le1nux · 2025-10-16T16:35:13Z

pyproject.toml

@@ -1,12 +1,10 @@
 [project]
 name = "modalities"
 version = "0.3.2"


Why did we remove this? Our testing is always against 3.10 and 3.11. Do we need a more recent python version?

le1nux · 2025-10-16T16:37:04Z

src/modalities/loss_functions.py

+        ...
+
+    def __call__(self, *args, **kwargs) -> torch.Tensor:
+        labels, lm_logits = self._parse_arguments(args, kwargs)


Could be improved from a software engineering point of view

le1nux · 2025-10-16T16:38:49Z

src/modalities/loss_functions.py

+    def _parse_arguments(
+        self,
+        args: list[torch.Tensor] | list[InferenceResultBatch],
+        kwargs: dict[str, torch.Tensor] | dict[str, InferenceResultBatch],
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        if len(args) == 1 and isinstance(args[0], InferenceResultBatch):
+            forward_batch = args[0]
+            labels = forward_batch.get_targets(self.target_key)
+            lm_logits = forward_batch.get_predictions(self.prediction_key)
+        elif "forward_batch" in kwargs and isinstance(kwargs["forward_batch"], InferenceResultBatch):
+            forward_batch = kwargs["forward_batch"]
+            labels = forward_batch.get_targets(self.target_key)
+            lm_logits = forward_batch.get_predictions(self.prediction_key)
+        elif len(args) == 2 and all(isinstance(arg, torch.Tensor) for arg in args):
+            lm_logits, labels = args
+        elif (
+            "outputs" in kwargs
+            and "targets" in kwargs
+            and isinstance(kwargs["outputs"], torch.Tensor)
+            and isinstance(kwargs["targets"], torch.Tensor)
+        ):
+            lm_logits = kwargs["outputs"]
+            labels = kwargs["targets"]
+        elif (
+            len(args) == 1
+            and "targets" in kwargs
+            and isinstance(args[0], torch.Tensor)
+            and isinstance(kwargs["targets"], torch.Tensor)
+        ):
+            lm_logits = args[0]
+            labels = kwargs["targets"]
+        else:
+            raise TypeError("Invalid arguments for CLMCrossEntropyLoss.__call__")
+        return labels, lm_logits
+


Idea: What about defining a new component "pp-loss", which takes a normal loss function and handles the PP-specific part?

Generally, I think this parsing function could be improved.

le1nux · 2025-10-16T17:52:01Z

src/modalities/training/gradient_clipping/fsdp_gradient_clipper_config.py

+    device_mesh: PydanticDeviceMeshIFType | None = None


 class DummyGradientClipperConfig(BaseModel):


can we remove this class now?

le1nux · 2025-10-16T18:54:27Z

src/modalities/evaluator.py

        with torch.no_grad():
-            result_batch = model_predict_batch(model=model, batch=batch)
-        loss = loss_fun(result_batch)
+            if scheduled_pipeline is not None:


basically code duplication from the trainer.
Also not a big fan of passing the scheduled_pipeline in here.

…del's weight_decay_groups.

rrutmann and others added 20 commits September 5, 2025 18:14

feat: Make test for pipeline parallelism work

1d4943f

refactor(parallelism): Removed necessity of additional model and loss…

5b53ff9

… classes for pipeline parallelism.

refactor(parallelism): Clean up for pp test.

5147a7a

test: Print losses to debug tests

1cb9779

feat: Use scheduled_pipeline for forwad backward pass

27ad56d

feat: Use scheduled_pipeline for training

41c4f36

feat: Use scheduled_pipe in evaluation

6f3d5da

test: Print losses if test fails

9b85334

chore: Run evaluation before training

84e2702

chore: Increase microbatch size

32fbe94

fix: Use dp size instead of world size for last batch aggregation

61ab311

docs: Add TODOs for later check

6952bcc

fix: Train before evaluation so that pp is initialized for backwards

90dbe51

fix: Add missing parameter seed to GPT2LLMConfig

49df7d6

fix: Retrieve all PP ranks for gradient clipping

7996a29

test: Add new parameter num_data_parallel_ranks to Trainer

cbddcbc

fix: Make FSDP1GradientClipperConfig independent of device_mesh

56a917a

fix: Handle optional device_mesh correctly

eb47aa9

feat: Consider pipeline parallelism in tensor pallelization

d228351

test: Use the same data on each rank & test tensor parallelism

55dad72

BlueCrescent requested a review from Copilot September 17, 2025 16:23

Copilot AI reviewed Sep 17, 2025

View reviewed changes

BlueCrescent and others added 8 commits September 17, 2025 19:47

refactor(parallelism): Some clean-up.

b6a1e2d

chore: Merge branch 'pipeline_parallelism_fix' of github.com:Modaliti…

16a51af

…es/modalities into pipeline_parallelism_fix

test: Update configs for parallelization testing

c49895a

test: Use correct length to create test sequences

f685fc5

test: Use realistic std for model initialization

c07fcf6

fix: Remove unused third dimension for reduced_losses

5019bbb

refactor: Remove unused filtering

a08e555

fix: Aggregate loss of last train batch correct across pp ranks

45b5418

rrutmann added 5 commits September 22, 2025 14:53

docs: Add example config for pipeline and tensor parallelism

a394ab0

docs: Add docstrings and type hints

cae050e

docs: Add type hints and docstrings

6952230

fix: Check if parallelism method is initialized

ffa032c

docs: Add new parameter in docstring

8d418a1

rrutmann self-assigned this Sep 22, 2025

rrutmann requested a review from le1nux September 22, 2025 16:02

rrutmann added 2 commits September 23, 2025 11:40

test: Run only one PP only test

fffd0a1

refactor: Addressed copilot review

049472f

le1nux added this to the 100B milestone Oct 8, 2025

le1nux linked an issue Oct 8, 2025 that may be closed by this pull request

Epic: Pipeline Parallelism #401

Open

chore: Remove requirements for python and torch

608c7fc

le1nux reviewed Oct 16, 2025

View reviewed changes

rrutmann and others added 4 commits October 17, 2025 14:15

fix: Allow dp shard degree 1

16c4bc4

test: Add test for checkpointing with pipeline parallelism

f5a1020

fix(parallelism): Building model stages in PP now also filters the mo…

9d1f107

…del's weight_decay_groups.

test(checkpointing): Some fixes for pp checkpointing test.

dfc1bde

		device_mesh: PydanticDeviceMeshIFType \| None = None


		class DummyGradientClipperConfig(BaseModel):

Pipeline parallelism continued #399

Are you sure you want to change the base?

Pipeline parallelism continued #399

Uh oh!

Conversation

BlueCrescent commented Sep 17, 2025 • edited by rrutmann Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist before submitting final PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BlueCrescent commented Sep 17, 2025 •

edited by rrutmann

Loading