Skip to content

nuplan_rasterize error at begining of training #197

@hz3014

Description

@hz3014

when i run the training code:

python -m torch.distributed.run --nproc_per_node=2 --master_port 12345 runner.py --model_name scratch-mixtral-800m-deep \

I encounter following error:

[INFO|nuplan_raster_encoder.py:85] 2025-02-21 11:47:24,974 >> Building ViT encoder with key points indices of [79, 39, 19, 9, 4]
[INFO|str_base.py:105] 2025-02-21 11:47:24,978 >> Now using z-loss for MoE router balancing.
[INFO|str_base.py:1168] 2025-02-21 11:47:25,974 >> Scratch MixTralTrajectory Initialized!
[INFO|trainer.py:746] 2025-02-21 11:47:26,694 >> Using auto half precision backend
[INFO|trainer.py:2405] 2025-02-21 11:47:26,891 >> ***** Running training *****

  0%|                                                        | 0/19403840 [00:00<?, ?it/s][WARNING|modeling_utils.py:1210] 2025-02-21 11:47:33,421 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
  0%|                                               | 15/19403840 [00:13<2575:12:02,  2.09it/s]Traceback (most recent call last):
.....
ValueError: Caught ValueError in DataLoader worker process 16.
Original Traceback (most recent call last):
  File "/home/avt/anaconda3/envs/statetrm/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/avt/anaconda3/envs/statetrm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
  File "/media/avt/P7000Z_2TB/StateTransformer/transformer4planning/preprocess/nuplan_rasterize.py", line 52, in nuplan_rasterize_collate_func
    rst = map_func(d)
  File "/media/avt/P7000Z_2TB/StateTransformer/transformer4planning/preprocess/nuplan_rasterize.py", line 199, in static_coor_rasterize
    check_distance=ego_pose_agent_dic[frame_id // frequency_change_rate-10:frame_id // frequency_change_rate+81:10]-ego_pose_agent_dic[frame_id // frequency_change_rate-20:frame_id // frequency_change_rate+71:10]
ValueError: operands could not be broadcast together with shapes (9,4) (10,4) 

I am training with the provided dataset. BTW i have disabled --use_speed as it encounter earlier error.
Do you have any idea why this is happening? Thank you in advance for the help @JohnZhan2023 @larksq

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions