-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Description
when i run the training code:
python -m torch.distributed.run --nproc_per_node=2 --master_port 12345 runner.py --model_name scratch-mixtral-800m-deep \
I encounter following error:
[INFO|nuplan_raster_encoder.py:85] 2025-02-21 11:47:24,974 >> Building ViT encoder with key points indices of [79, 39, 19, 9, 4]
[INFO|str_base.py:105] 2025-02-21 11:47:24,978 >> Now using z-loss for MoE router balancing.
[INFO|str_base.py:1168] 2025-02-21 11:47:25,974 >> Scratch MixTralTrajectory Initialized!
[INFO|trainer.py:746] 2025-02-21 11:47:26,694 >> Using auto half precision backend
[INFO|trainer.py:2405] 2025-02-21 11:47:26,891 >> ***** Running training *****
0%| | 0/19403840 [00:00<?, ?it/s][WARNING|modeling_utils.py:1210] 2025-02-21 11:47:33,421 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
0%| | 15/19403840 [00:13<2575:12:02, 2.09it/s]Traceback (most recent call last):
.....
ValueError: Caught ValueError in DataLoader worker process 16.
Original Traceback (most recent call last):
File "/home/avt/anaconda3/envs/statetrm/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/home/avt/anaconda3/envs/statetrm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
File "/media/avt/P7000Z_2TB/StateTransformer/transformer4planning/preprocess/nuplan_rasterize.py", line 52, in nuplan_rasterize_collate_func
rst = map_func(d)
File "/media/avt/P7000Z_2TB/StateTransformer/transformer4planning/preprocess/nuplan_rasterize.py", line 199, in static_coor_rasterize
check_distance=ego_pose_agent_dic[frame_id // frequency_change_rate-10:frame_id // frequency_change_rate+81:10]-ego_pose_agent_dic[frame_id // frequency_change_rate-20:frame_id // frequency_change_rate+71:10]
ValueError: operands could not be broadcast together with shapes (9,4) (10,4)
I am training with the provided dataset. BTW i have disabled --use_speed as it encounter earlier error.
Do you have any idea why this is happening? Thank you in advance for the help @JohnZhan2023 @larksq
Metadata
Metadata
Assignees
Labels
No labels