Skip to content

Multi gpu training error #87

@IceIce1ce

Description

@IceIce1ce

Hi author, thanks for your great code. When I run your training script with multi-gpu setting, an error happened. Can you check your code again? Here is my training script: uv run accelerate launch --num_processes 2 scripts/train.py configs/ltx2_av_lora_low_vram.yaml

Training 0/2000 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Starting... 0:03:59 ETA: --:--[rank1]: Traceback (most recent call last):
[rank1]: File "/home/vsw/Desktop/LTX-2/packages/ltx-trainer/scripts/train.py", line 64, in
[rank1]: app()
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/typer/main.py", line 336, in call
[rank1]: raise e
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/typer/main.py", line 319, in call
[rank1]: return get_command(self)(*args, **kwargs)
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/click/core.py", line 1485, in call
[rank1]: return self.main(*args, **kwargs)
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/typer/core.py", line 719, in main
[rank1]: return _main(
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/typer/core.py", line 189, in _main
[rank1]: rv = self.invoke(ctx)
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/click/core.py", line 1269, in invoke
[rank1]: return ctx.invoke(self.callback, **ctx.params)
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/click/core.py", line 824, in invoke
[rank1]: return callback(*args, **kwargs)
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/typer/main.py", line 706, in wrapper
[rank1]: return callback(**use_params)
[rank1]: File "/home/vsw/Desktop/LTX-2/packages/ltx-trainer/scripts/train.py", line 60, in main
[rank1]: trainer.train(disable_progress_bars=disable_progress_bars)
[rank1]: File "/home/vsw/Desktop/LTX-2/packages/ltx-trainer/src/ltx_trainer/trainer.py", line 169, in train
[rank1]: loss = self._training_step(batch)
[rank1]: File "/home/vsw/Desktop/LTX-2/packages/ltx-trainer/src/ltx_trainer/trainer.py", line 312, in _training_step
[rank1]: video_embeds, audio_embeds, attention_mask = self._text_encoder._run_connectors(
[rank1]: File "/home/vsw/Desktop/LTX-2/packages/ltx-core/src/ltx_core/text_encoders/gemma/encoders/av_encoder.py", line 53, in _run_connectors
[rank1]: encoded, encoded_connector_attention_mask = self.embeddings_connector(
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/vsw/Desktop/LTX-2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/vsw/Desktop/LTX-2/packages/ltx-core/src/ltx_core/text_encoders/gemma/embeddings_connector.py", line 173, in forward
[rank1]: hidden_states, attention_mask = self._replace_padded_with_learnable_registers(hidden_states, attention_mask)
[rank1]: File "/home/vsw/Desktop/LTX-2/packages/ltx-core/src/ltx_core/text_encoders/gemma/embeddings_connector.py", line 148, in _replace_padded_with_learnable_registers
[rank1]: hidden_states = flipped_mask * adjusted_hidden_states + (1 - flipped_mask) * learnable_registers
[rank1]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
Training 0/2000 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Starting... 0:04:16 ETA: --:--

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions