-
Notifications
You must be signed in to change notification settings - Fork 75
Facing issues when trying to reproduce the same run with 4xH200s #27
Copy link
Copy link
Open
Description
Description
I followed the installation instructions in the repository's README.md
and attempted to run the SDPO generalization experiment. However, the
training process fails during model initialization.
Environment
- Python version: 3.12\
- CUDA version: 12.8\
- PyTorch version: 2.8.0\
- GPU: 2 × NVIDIA H200
Steps to Reproduce
- Follow the installation instructions from
README.md. - Run the experiment script:
bash experiments/generalization/run_sdpo_all.sh- The process fails during model initialization.
Error Logs
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(TaskRunner pid=37717) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_ref_init_model() (pid=38309, ip=172.17.0.3, actor_id=d6503e2665e2d40fc6795be001000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7114d32adb20>)
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(TaskRunner pid=37717) return self.__get_result()
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(TaskRunner pid=37717) raise self._exception
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/single_controller/ray/base.py", line 844, in func
(TaskRunner pid=37717) return getattr(self.worker_dict[key], name)(*args, **kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/single_controller/base/decorator.py", line 462, in inner
(TaskRunner pid=37717) return func(*args, **kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/utils/transferqueue_utils.py", line 314, in dummy_inner
(TaskRunner pid=37717) output = func(*args, **kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/workers/fsdp_workers.py", line 812, in init_model
(TaskRunner pid=37717) ) = self._build_model_optimizer(
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/workspace/verl/workers/fsdp_workers.py", line 400, in _build_model_optimizer
(TaskRunner pid=37717) actor_module = actor_module_class.from_pretrained(
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
(TaskRunner pid=37717) return model_class.from_pretrained(
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 288, in _wrapper
(TaskRunner pid=37717) return func(*args, **kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5103, in from_pretrained
(TaskRunner pid=37717) model = cls(config, *model_args, **model_kwargs)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/models/qwen3/modeling_qwen3.py", line 435, in __init__
(TaskRunner pid=37717) super().__init__(config)
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2197, in __init__
(TaskRunner pid=37717) self.config._attn_implementation_internal = self._check_and_adjust_attn_implementation(
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2812, in _check_and_adjust_attn_implementation
(TaskRunner pid=37717) lazy_import_flash_attention(applicable_attn_implementation)
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 136, in lazy_import_flash_attention
(TaskRunner pid=37717) _flash_fn, _flash_varlen_fn, _pad_fn, _unpad_fn = _lazy_imports(implementation)
(TaskRunner pid=37717) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 83, in _lazy_imports
(TaskRunner pid=37717) from flash_attn import flash_attn_func, flash_attn_varlen_func
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/flash_attn/__init__.py", line 3, in <module>
(TaskRunner pid=37717) from flash_attn.flash_attn_interface import (
(TaskRunner pid=37717) File "/venv/main/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module>
(TaskRunner pid=37717) import flash_attn_2_cuda as flash_attn_gpu
(TaskRunner pid=37717) ImportError: /venv/main/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_jb
(WorkerDict pid=38309) /workspace/verl/utils/tokenizer.py:107: UserWarning: Failed to create processor: Unsupported processor type: Qwen2TokenizerFast. This may affect multimodal processing [repeated 3x across cluster]
(WorkerDict pid=38309) warnings.warn(f"Failed to create processor: {e}. This may affect multimodal processing", stacklevel=1) [repeated 3x across cluster]
(WorkerDict pid=38309) `torch_dtype` is deprecated! Use `dtype` instead! [repeated 3x across cluster]
Any guidance would be appreciated. Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels