models/modeling_llama_opt.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -1335,13 +1335,6 @@ def forward( @@
             # embed positions
             hidden_states = inputs_embeds
-            if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning_once(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
             # decoder layers
             all_hidden_states = () if output_hidden_states else None
             all_self_attns = () if output_attentions else None
@@ Expand Down @@

run_clm.sh

-Original file line number
+Diff line change
@@ Expand Up / @@ -18,6 +18,7 @@ export LCKV_FUSED_SWIGLU=1 @@
     #  - to pretrain a tinyllama, change the config to `TinyLlama/TinyLlama-1.1B-intermediate-step-955k-token-2T`
     #  - to intialize the model with a pretrained model, add `--model_name_or_path TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T`
     #  - to use the minipile dataset, use `--dataset_name JeanKaddour/minipile`, with proper `--preprocessing_num_workers`
+    #  - to use gradient checkpointing, add `--gradient_checkpointing`
     #  - to enable wandb, use `--report_to wandb`
     accelerate launch run_clm.py \
         --tokenizer_name TinyLlama/TinyLlama-1.1B-intermediate-step-955k-token-2T \
@@ Expand Down @@

Support gradient checkpointing for lckv #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

why-in-Shanghaitech wants to merge 1 commit into dev-lckv-publish from dev-grad-ckpt

-Original file line number
+Diff line change
@@ Expand Up / @@ -1335,13 +1335,6 @@ def forward( @@
             # embed positions
             hidden_states = inputs_embeds
-            if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning_once(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
             # decoder layers
             all_hidden_states = () if output_hidden_states else None
             all_self_attns = () if output_attentions else None
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -18,6 +18,7 @@ export LCKV_FUSED_SWIGLU=1 @@
     #  - to pretrain a tinyllama, change the config to `TinyLlama/TinyLlama-1.1B-intermediate-step-955k-token-2T`
     #  - to intialize the model with a pretrained model, add `--model_name_or_path TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T`
     #  - to use the minipile dataset, use `--dataset_name JeanKaddour/minipile`, with proper `--preprocessing_num_workers`
+    #  - to use gradient checkpointing, add `--gradient_checkpointing`
     #  - to enable wandb, use `--report_to wandb`
     accelerate launch run_clm.py \
         --tokenizer_name TinyLlama/TinyLlama-1.1B-intermediate-step-955k-token-2T \
@@ Expand Down @@

Provide feedback