-
Notifications
You must be signed in to change notification settings - Fork 0
[WIP] HF State Dict SmolVLM #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tomiock
wants to merge
19
commits into
main
Choose a base branch
from
smolvlm_hf_state_dict
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Benchmarking
<meta charset="utf-8"><b style="font-weight:normal;"
id="docs-internal-guid-852d634c-7fff-a3ae-72e8-d17e64bb4b2c"><div
dir="ltr" style="margin-left:0pt;" align="center">
Step | time | log
-- | -- | --
to_hf() | 0.1103s | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root
- INFO - Completed to_hf conversion, generated 189 keys, duration:
0.1103s
Split local GroupedExperts DTensor to individual experts’ weight | 0.008
s per layer per matrix (total 58 MoE Layers * 3 weight matrices per
layer) | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO -
Completed _get_local_experts_weights for layer 6, abstract_key:
model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s
dcp.load()Threads count=4 | 193.20s | [trainer0\|0]:[titan] 2025-10-03
17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader
completed in 193.20 seconds
from_hf() | 0.48s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,378 - root
- INFO - Completed from_hf conversion, processed 189 keys, duration:
0.4787s
Concatenate individual experts weight into GroupedExperts weight | 0.01s
per layer per matrix (total 58 MoE Layers * 3 weight matrices) |
[trainer0\|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed
_concatenate_expert_weights_dtensor for layer 5, abstract_key:
layers.{}.moe.experts.w2, duration: 0.0142s
Total | 193.87s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,458 - root -
INFO - Finished loading the checkpoint in 193.87 seconds.
</div></b>
## End-to-End verification for 671B model
Parallelsim: FSDP=32, PP=8, 1F1B, EP=32
<img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 37 PM"
src="https://github.com/user-attachments/assets/6d8dab00-a188-4c57-8348-02bae1d21d03"
/>
<img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 54 PM"
src="https://github.com/user-attachments/assets/a730f71b-3dc8-45e0-8d3e-b21080884f8d"
/>
With max-autotune, FlexAttention is not deterministic even if torch.use_deterministic_algorithms is True. When deterministic mode is set, we should also remove the usage of `max-autotune`.
Summary: allow users to specify the profiler schedule --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1809). * #1811 * #1810 * #1812 * __->__ #1809 Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com>
this PR is a followup of SimpleFSDP+EP [PR](pytorch/torchtitan#1529). Here, we add a `gradient_divide_factor` following FSDP2 to ensure modules wrapped by (FSDP+EP) has the correct gradient reduction value. - The original FSDP2 implementation is in this [PR](pytorch/torchtitan#1551). - The `gradient_divide_factor` logic is [here](https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py#L688) We have two ways of handling `gradient_divide_factor` in `reduce_scatter`: 1. The first one is to use `ReduceOp.PREMUL_SUM` to handle the `gradient_divide_factor`. However, DTensor's `_reduce_shard_value` only accepts `reduce_op` as a str input ([here](https://github.com/pytorch/pytorch/blob/8f705d019a64b1ca882e043b3eb98559273a9e59/torch/distributed/tensor/placement_types.py#L177-L210)). To make` _reduce_shard_value` work correctly with ReduceOp.PREMUL_SUM, we need to update the DTensor `_reduce_shard_tensor` and `torch.distributed._functional_collectives.reduce_scatter_tensor` so that it can pass the factor associated with ReduceOp.PREMUL_SUM as an input. 2. Another way is to simulate `ReduceOp.PREMUL_SUM` with `ReduceOp.SUM`. The logic is in this [Diff](https://www.internalfb.com/diff/D76546536). It does a `div_` over gradient before performing `ReduceOp.SUM`. Currently I'm following 2 since it is requires less change to `_functional_collectives`. After enabling `reduction_divide_factor`, we will see FSDP(=2) + EP (=4) have identical loss: <img width="1194" height="780" alt="Screenshot 2025-10-08 at 5 27 24 PM" src="https://github.com/user-attachments/assets/aaf83109-8db8-4051-973d-c7b6950513de" />
Llama 3.1 models use scaled RoPE by default, and Llama 4 17B x 16E uses scaled RoPE while 17B x 128E does not. 1. Verified forward parity between Titan Llama 3.1 8B and HuggingFace Llama 3.1 8B. The KL divergence of outputs from the same sample inputs is small.  For comparison, before adding scaled RoPE support, the forward parity check on the Llama 3.1 8B model incurred a slightly larger KL divergence on sample inputs.  2. Verified training of Llama 3.1 8B with tensor parallel degree = 4.  3. Verified training of Llama 4 debug model with scaled RoPE. 
The forge's doc build is failing with some formatting issues that seem to come from the torchtitan docstrings: ``` docstring of torchtitan.config.job_config.Parallelism.fsdp_reshard_after_forward:7: ERROR: Unexpected indentation. docstring of torchtitan.config.job_config.Parallelism.fsdp_reshard_after_forward:8: WARNING: Block quote ends without a blank line; unexpected unindent. docstring of torchtitan.config.job_config.Parallelism.expert_parallel_degree:4: ERROR: Unexpected indentation. docstring of torchtitan.config.job_config.Parallelism.expert_parallel_degree:7: WARNING: Block quote ends without a blank line; unexpected unindent. docstring of torchtitan.config.job_config.Parallelism.expert_parallel_degree:11: WARNING: Bullet list ends without a blank line; unexpected unindent. docstring of torchtitan.config.job_config.Checkpoint.async_mode:5: ERROR: Unexpected indentation. ``` Failing [job](https://github.com/meta-pytorch/forge/actions/runs/18360538773/job/52303073438?pr=336#step:11:73). This PR fixes those minor formatting issues.
Fix the number of layer issue introduced by #1804
Inspired by the blogpost: https://pytorch.org/blog/activation-checkpointing-techniques/
In VLM interleaved training, with native resolution and aspect ratio,
the number of tokens participating in loss computation differ per rank.
Naive FSDP gradient averaging across data ranks can causes tokens on
ranks with fewer valid tokens to contribute more to the loss than on
other ranks.
This PR address this via loss balancing, which incur an additional comm
in the loss computation.
In practice, I haven't notice any impacts from this comm.
#### Quick sanity check
Let have a sum loss of all tokens on each rank i, with $N_i$ number of
tokens $L_i = \sum_{j=1}^{N_i}\ell_{ij}$ and its gradient $g_i =
\sum_{j=1}^{N_i}\nabla\ell_{ij}$
If we multiply the *loss* on each rank by a constant factor **c** (the
same for all ranks), then after `backward()`:
$$
\tilde g_i = c \cdot g_i .
$$
FSDP will *average* these gradients across ranks:
$$
g_{\text{FSDP}}=\frac{1}{R}\sum_{i=1}^{R} \tilde g_i
=\frac{c}{R}\sum_{i=1}^{R} g_i .
$$
We want this to equal the **global‑sample average**:
$$
g_{\text{true}}
=\frac{1}{N_{\text{total}}}\sum_{i=1}^{R}\sum_{j=1}^{N_i}\nabla
\ell_{ij}
=\frac{1}{N_{\text{total}}}\sum_{i=1}^{R} g_i .
$$
Thus for FSDP gradient to be correct, we need
$$
\frac{c}{R}= \frac{1}{N_{\text{total}}}\quad\Longrightarrow\quad
c=\frac{R}{N_{\text{total}}}.
$$
So the *right* scaling factor is $R/N_{\text{total}}$, which mean divide
the per-rank sum loss with $N_{\text{total}}/R$, which is **average
number of tokens per rank**.
Intuitively, this is the same as default cross-entropy loss, but instead
of diving sum loss on a rank by the number of tokens **on that rank**,
we now divide by the **average number of tokens across all rank**
P/s: sorry this PR is based on #1802 but I couldn't choose that as the
base branch. Maybe it will be easier to review once that PR is merged.
…#1849) A test run on vlm debugmodel: <img width="1109" height="491" alt="image" src="https://github.com/user-attachments/assets/5763aa67-946e-4ab3-9ce8-e884fa3d1776" />
…#1776) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #1797 * __->__ #1776 **Status** 1. Change all models, including the experimental ones. 2. E2E loss verification. 3. We should add an unittest for attention. But since we don't have GPU unittest, this can be done in a separate PR. **Summary** This PR aims to refactor how TorchTitan build the attention masks and pass to model. Before this PR, init_attention_masks() is called in Trainer but the masks are stored as a class variable of FlexAttentionWrapper(). We chose this shortcut to support the case where a single model requires multiple masks. The previous design has several issues, one particular one is pytorch/torchtitan#1723. pytorch/pytorch#164111 proves that we can let PP split BlockMask, this PR performs the refactor to pass masks as an argument of model.forward(). The new design: 1. Model needs to provide `get_attention_masks()` that accepts `create_mask_fn`, `batch`, and `eos_id`. If the attention op is SDPA, then this API should return None as SDPA currently doesn't support varlen. But once it does, we may have to return some tuple of int that represents the mask. Justification: attention logic is technically a part of the model, but requires some information from trainer/dataloader. So it's model author's responsibility to provide some API that let trainer calls to get the masks. 2. `get_attention_masks()` will be called from the trainer and the resulting masks are passed to the model.forward(). Justification: this will allow us to fix pytorch/torchtitan#1723 with pytorch/pytorch#164111 and this PR. 3. Now SDPA and FlexAttention are wrapped in two different classes. ~~Note: we still have two very very thin op wrappers that are used for CP. I keep these two for the CP education purpose. But this certainly can be confusion for Titan's users. I'm opnn to merge them to AttentionOp.~~ See the discussion in pytorch/torchtitan#1723. **Verification** *llama3* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint --config="./torchtitan/models/llama3/train_configs/debug_model.toml" ``` *llama3 flex* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint --config="./torchtitan/models/llama3/train_configs/debug_model.toml" --baseline-train-options="--model.flavor=debugmodel_flex_attn" ``` *llama4* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint ``` *llama4 irope* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint ``` *deepseek* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint --config="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ``` *deepseek flex* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint --config="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" --baseline-train-options="--model.flavor=debugmodel_flex_attn" ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The state dict is loaded correctly (in theory).
But when generating with the loaded weights the outputs is garbage:
