Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
1d4943f
feat: Make test for pipeline parallelism work
rrutmann Sep 5, 2025
5b53ff9
refactor(parallelism): Removed necessity of additional model and loss…
BlueCrescent Sep 8, 2025
5147a7a
refactor(parallelism): Clean up for pp test.
BlueCrescent Sep 8, 2025
1cb9779
test: Print losses to debug tests
rrutmann Sep 8, 2025
27ad56d
feat: Use scheduled_pipeline for forwad backward pass
rrutmann Sep 9, 2025
41c4f36
feat: Use scheduled_pipeline for training
rrutmann Sep 9, 2025
6f3d5da
feat: Use scheduled_pipe in evaluation
rrutmann Sep 9, 2025
9b85334
test: Print losses if test fails
rrutmann Sep 9, 2025
84e2702
chore: Run evaluation before training
rrutmann Sep 9, 2025
32fbe94
chore: Increase microbatch size
rrutmann Sep 9, 2025
61ab311
fix: Use dp size instead of world size for last batch aggregation
rrutmann Sep 10, 2025
6952bcc
docs: Add TODOs for later check
rrutmann Sep 10, 2025
90dbe51
fix: Train before evaluation so that pp is initialized for backwards
rrutmann Sep 10, 2025
49df7d6
fix: Add missing parameter seed to GPT2LLMConfig
rrutmann Sep 12, 2025
7996a29
fix: Retrieve all PP ranks for gradient clipping
rrutmann Sep 15, 2025
cbddcbc
test: Add new parameter num_data_parallel_ranks to Trainer
rrutmann Sep 15, 2025
56a917a
fix: Make FSDP1GradientClipperConfig independent of device_mesh
rrutmann Sep 15, 2025
eb47aa9
fix: Handle optional device_mesh correctly
rrutmann Sep 15, 2025
d228351
feat: Consider pipeline parallelism in tensor pallelization
rrutmann Sep 17, 2025
55dad72
test: Use the same data on each rank & test tensor parallelism
rrutmann Sep 17, 2025
b6a1e2d
refactor(parallelism): Some clean-up.
BlueCrescent Sep 17, 2025
16a51af
chore: Merge branch 'pipeline_parallelism_fix' of github.com:Modaliti…
rrutmann Sep 18, 2025
c49895a
test: Update configs for parallelization testing
rrutmann Sep 19, 2025
f685fc5
test: Use correct length to create test sequences
rrutmann Sep 19, 2025
c07fcf6
test: Use realistic std for model initialization
rrutmann Sep 19, 2025
5019bbb
fix: Remove unused third dimension for reduced_losses
rrutmann Sep 19, 2025
a08e555
refactor: Remove unused filtering
rrutmann Sep 19, 2025
45b5418
fix: Aggregate loss of last train batch correct across pp ranks
rrutmann Sep 22, 2025
a394ab0
docs: Add example config for pipeline and tensor parallelism
rrutmann Sep 22, 2025
cae050e
docs: Add docstrings and type hints
rrutmann Sep 22, 2025
6952230
docs: Add type hints and docstrings
rrutmann Sep 22, 2025
ffa032c
fix: Check if parallelism method is initialized
rrutmann Sep 22, 2025
8d418a1
docs: Add new parameter in docstring
rrutmann Sep 22, 2025
fffd0a1
test: Run only one PP only test
rrutmann Sep 23, 2025
049472f
refactor: Addressed copilot review
rrutmann Sep 24, 2025
608c7fc
chore: Remove requirements for python and torch
rrutmann Oct 15, 2025
16c4bc4
fix: Allow dp shard degree 1
rrutmann Oct 17, 2025
f5a1020
test: Add test for checkpointing with pipeline parallelism
rrutmann Oct 17, 2025
9d1f107
fix(parallelism): Building model stages in PP now also filters the mo…
BlueCrescent Oct 17, 2025
dfc1bde
test(checkpointing): Some fixes for pp checkpointing test.
BlueCrescent Oct 17, 2025
cd9f595
test(checkpointing): Made dcp checkpointing test terminate correctly …
BlueCrescent Oct 20, 2025
edf7a4e
test(checkpointing): Checkpointing equality tests now explicitly only…
BlueCrescent Oct 21, 2025
abcf235
fix: Use ModuleDict for transformer layers for correct checkpointing …
Oct 21, 2025
554cd39
chore: Rename layer_id to layer_idx
Oct 21, 2025
484815e
test: Adapt tests to new gpt2 model structure
Oct 21, 2025
ddb249b
test: Adapt code to latest changes to pass tests
rrutmann Oct 21, 2025
51b7db4
test(data): Added tests for distributed multi dim data sampling.
BlueCrescent Oct 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions config_files/training/config_lorem_ipsum_long_fsdp2_pp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ settings:
enforce_last_step_checkpointed: false
step_profile:
gradient_accumulation_steps: 1
local_train_micro_batch_size: 2
local_train_micro_batch_size: 4
sequence_length: 256
training_target:
num_target_tokens:
Expand Down Expand Up @@ -222,7 +222,7 @@ scheduled_pipeline:
pass_type: BY_REFERENCE
pp_schedule_name: gpipe
batch_size: ${settings.step_profile.local_train_micro_batch_size}
microbatch_size: 1
microbatch_size: 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we reference this from the top?

pp_degree: ${device_mesh.config.pipeline_parallel_degree}
# maybe better to use the fsdp model and the schedule here
# instead of passing in the staged pipeline?
Expand Down
Loading