Add TRANSFORMER warmup policy for learning rate scheduling #3548

pjelement · 2025-11-15T10:25:51Z

Summary:
This diff implements the TRANSFORMER warmup policy from "Attention is All You Need" (Vaswani et al., 2017) for learning rate scheduling in torchrec, and updates a model configuration to use it.

Implementation

Added WarmupPolicy.TRANSFORMER to fbcode/torchrec/optim/warmup.py which implements the formula:

lr = base_lr * min(step^(-0.5), step * warm_steps^(-1.5)) * lr_scale

This schedule provides:

Warmup phase: LR increases from near-zero to peak at warm_steps
Decay phase: LR decreases via inverse square root after warm_steps

The max_iters parameter serves as warm_steps in the formula. The schedule converges at step = warm_steps where both terms in the min() function become equal.

Testing

Added comprehensive unit tests in fbcode/torchrec/optim/tests/test_warmup.py:

Formula correctness at key milestones (step 1, warmup completion, post-warmup)
Monotonic increase during warmup phase
Monotonic decrease during decay phase
Proper application of lr_scale multiplier
Integration tests with WarmupOptimizer
Uses none_throws() from pyre_extensions for type-safe Optional handling

Updated fbcode/torchrec/optim/tests/BUCK to include pyre-extensions dependency.

Configuration Update

Updated fbcode/minimal_viable_ai/models/gysj/gysj_esr_roo/conf/model_roo_config.py to use TRANSFORMER warmup:

Changed both sparse and dense optimizers from LINEAR warmup to TRANSFORMER
Set warm_steps to 80,000 for both optimizers
Updated optimizer hyperparameters (lr=0.001, eps=1e-6, beta values) to work with TRANSFORMER schedule
Sparse optimizer changed from ROWWISE_ADAGRAD to ADAM for consistency with dense optimizer

Differential Revision: D87127589

meta-codesync · 2025-11-15T10:26:01Z

@pjelement has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87127589.

…rch#3548) Summary: This diff implements the TRANSFORMER warmup policy from "Attention is All You Need" (Vaswani et al., 2017) for learning rate scheduling in torchrec, and updates a model configuration to use it. ## Implementation Added `WarmupPolicy.TRANSFORMER` to `fbcode/torchrec/optim/warmup.py` which implements the formula: ``` lr = base_lr * min(step^(-0.5), step * warm_steps^(-1.5)) * lr_scale ``` This schedule provides: - **Warmup phase**: LR increases from near-zero to peak at `warm_steps` - **Decay phase**: LR decreases via inverse square root after `warm_steps` The `max_iters` parameter serves as `warm_steps` in the formula. The schedule converges at step = warm_steps where both terms in the min() function become equal. ## Testing Added comprehensive unit tests in `fbcode/torchrec/optim/tests/test_warmup.py`: - Formula correctness at key milestones (step 1, warmup completion, post-warmup) - Monotonic increase during warmup phase - Monotonic decrease during decay phase - Proper application of lr_scale multiplier - Integration tests with WarmupOptimizer - Uses `none_throws()` from `pyre_extensions` for type-safe Optional handling Updated `fbcode/torchrec/optim/tests/BUCK` to include `pyre-extensions` dependency. ## Configuration Update Updated `fbcode/minimal_viable_ai/models/gysj/gysj_esr_roo/conf/model_roo_config.py` to use TRANSFORMER warmup: - Changed both sparse and dense optimizers from LINEAR warmup to TRANSFORMER - Set warm_steps to 80,000 for both optimizers - Updated optimizer hyperparameters (lr=0.001, eps=1e-6, beta values) to work with TRANSFORMER schedule - Sparse optimizer changed from ROWWISE_ADAGRAD to ADAM for consistency with dense optimizer Differential Revision: D87127589

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 15, 2025

pjelement force-pushed the export-D87127589 branch from cca714c to 9a2b2ac Compare November 15, 2025 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TRANSFORMER warmup policy for learning rate scheduling #3548

Add TRANSFORMER warmup policy for learning rate scheduling #3548

Uh oh!

pjelement commented Nov 15, 2025

Uh oh!

meta-codesync bot commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add TRANSFORMER warmup policy for learning rate scheduling #3548

Are you sure you want to change the base?

Add TRANSFORMER warmup policy for learning rate scheduling #3548

Uh oh!

Conversation

pjelement commented Nov 15, 2025

Implementation

Testing

Configuration Update

Uh oh!

meta-codesync bot commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant