-
Notifications
You must be signed in to change notification settings - Fork 12
Pipeline parallelism continued #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
BlueCrescent
wants to merge
47
commits into
pipeline_parallelism
Choose a base branch
from
pipeline_parallelism_fix
base: pipeline_parallelism
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
1d4943f
feat: Make test for pipeline parallelism work
rrutmann 5b53ff9
refactor(parallelism): Removed necessity of additional model and loss…
BlueCrescent 5147a7a
refactor(parallelism): Clean up for pp test.
BlueCrescent 1cb9779
test: Print losses to debug tests
rrutmann 27ad56d
feat: Use scheduled_pipeline for forwad backward pass
rrutmann 41c4f36
feat: Use scheduled_pipeline for training
rrutmann 6f3d5da
feat: Use scheduled_pipe in evaluation
rrutmann 9b85334
test: Print losses if test fails
rrutmann 84e2702
chore: Run evaluation before training
rrutmann 32fbe94
chore: Increase microbatch size
rrutmann 61ab311
fix: Use dp size instead of world size for last batch aggregation
rrutmann 6952bcc
docs: Add TODOs for later check
rrutmann 90dbe51
fix: Train before evaluation so that pp is initialized for backwards
rrutmann 49df7d6
fix: Add missing parameter seed to GPT2LLMConfig
rrutmann 7996a29
fix: Retrieve all PP ranks for gradient clipping
rrutmann cbddcbc
test: Add new parameter num_data_parallel_ranks to Trainer
rrutmann 56a917a
fix: Make FSDP1GradientClipperConfig independent of device_mesh
rrutmann eb47aa9
fix: Handle optional device_mesh correctly
rrutmann d228351
feat: Consider pipeline parallelism in tensor pallelization
rrutmann 55dad72
test: Use the same data on each rank & test tensor parallelism
rrutmann b6a1e2d
refactor(parallelism): Some clean-up.
BlueCrescent 16a51af
chore: Merge branch 'pipeline_parallelism_fix' of github.com:Modaliti…
rrutmann c49895a
test: Update configs for parallelization testing
rrutmann f685fc5
test: Use correct length to create test sequences
rrutmann c07fcf6
test: Use realistic std for model initialization
rrutmann 5019bbb
fix: Remove unused third dimension for reduced_losses
rrutmann a08e555
refactor: Remove unused filtering
rrutmann 45b5418
fix: Aggregate loss of last train batch correct across pp ranks
rrutmann a394ab0
docs: Add example config for pipeline and tensor parallelism
rrutmann cae050e
docs: Add docstrings and type hints
rrutmann 6952230
docs: Add type hints and docstrings
rrutmann ffa032c
fix: Check if parallelism method is initialized
rrutmann 8d418a1
docs: Add new parameter in docstring
rrutmann fffd0a1
test: Run only one PP only test
rrutmann 049472f
refactor: Addressed copilot review
rrutmann 608c7fc
chore: Remove requirements for python and torch
rrutmann 16c4bc4
fix: Allow dp shard degree 1
rrutmann f5a1020
test: Add test for checkpointing with pipeline parallelism
rrutmann 9d1f107
fix(parallelism): Building model stages in PP now also filters the mo…
BlueCrescent dfc1bde
test(checkpointing): Some fixes for pp checkpointing test.
BlueCrescent cd9f595
test(checkpointing): Made dcp checkpointing test terminate correctly …
BlueCrescent edf7a4e
test(checkpointing): Checkpointing equality tests now explicitly only…
BlueCrescent abcf235
fix: Use ModuleDict for transformer layers for correct checkpointing …
554cd39
chore: Rename layer_id to layer_idx
484815e
test: Adapt tests to new gpt2 model structure
ddb249b
test: Adapt code to latest changes to pass tests
rrutmann 51b7db4
test(data): Added tests for distributed multi dim data sampling.
BlueCrescent File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we reference this from the top?