Skip to content

Out of memory error on V100 #22

@Bonnie0126

Description

@Bonnie0126

First of all, I want to commend the team for this amazing work — it’s truly impressive!

I encountered an out of memory error when running the model on a single V100 GPU. Could you help me understand what might be causing this issue? Are there any specific configurations or settings I should consider to resolve it?

(/home/ubuntu/hugs) ubuntu@VM-0-8-ubuntu:~/ml-hugs-main$ python main.py --cfg_file cfg_files/release/neuman/hugs_human_scene.yaml dataset.seq=lab
Experiment hyperparams:
    dataset.seq, vals=['lab', 'citron', 'seattle', 'bike', 'jogging', 'parkinglot']
6it [00:00, 14648.33it/s]
2024-11-25 16:46:39.803 | INFO     | __main__:<module>:95 - Running 6 experiments
2024-11-25 16:46:39.804 | INFO     | __main__:<module>:105 - Running experiment 0 -- demo-dataset.seq=lab
2024-11-25 16:46:39.824 | INFO     | __main__:get_logger:53 - Logging to output/human_scene/neuman/lab/hugs_trimlp/demo-dataset.seq=lab/2024-11-25_16-46-39
2024-11-25 16:46:39.832 | INFO     | __main__:get_logger:54 - seed: 0
mode: human_scene
output_path: output
cfg_file: cfg_files/release/neuman/hugs_human_scene.yaml
exp_name: demo-dataset.seq=lab
dataset_path: ''
detect_anomaly: false
debug: false
wandb: false
logdir: output/human_scene/neuman/lab/hugs_trimlp/demo-dataset.seq=lab/2024-11-25_16-46-39
logdir_ckpt: output/human_scene/neuman/lab/hugs_trimlp/demo-dataset.seq=lab/2024-11-25_16-46-39/ckpt
eval: false
bg_color: white
dataset:
  name: neuman
  seq: lab
train:
  batch_size: 1
  num_workers: 0
  num_steps: 14998
  save_ckpt_interval: 15000
  val_interval: 1000
  anim_interval: 15000
  optim_scene: true
  save_progress_images: false
  progress_save_interval: 10
human:
  name: hugs_trimlp
  ckpt: null
  sh_degree: 0
  n_subdivision: 2
  only_rgb: false
  use_surface: false
  use_deformer: true
  init_2d: false
  disable_posedirs: true
  res_offset: false
  rotate_sh: false
  isotropic: false
  init_scale_multiplier: 0.5
  run_init: false
  estimate_delta: true
  triplane_res: 256
  optim_pose: true
  optim_betas: false
  optim_trans: true
  optim_eps_offsets: false
  activation: relu
  canon_nframes: 60
  canon_pose_type: da_pose
  knn_n_hops: 3
  lr:
    wd: 0.0
    position: 0.00016
    position_init: 0.00016
    position_final: 1.6e-06
    position_delay_mult: 0.01
    position_max_steps: 30000
    opacity: 0.05
    scaling: 0.005
    rotation: 0.001
    feature: 0.0025
    smpl_spatial: 2.0
    smpl_pose: 0.0001
    smpl_betas: 0.0001
    smpl_trans: 0.0001
    smpl_eps_offset: 0.0001
    lbs_weights: 0.0
    posedirs: 0.0
    percent_dense: 0.01
    appearance: 0.001
    geometry: 0.001
    vembed: 0.001
    deformation: 0.0001
    scale_lr_w_npoints: false
  loss:
    ssim_w: 0.2
    l1_w: 0.8
    lpips_w: 1.0
    lbs_w: 1000.0
    humansep_w: 1.0
    num_patches: 4
    patch_size: 128
    use_patches: 1
  densification_interval: 600
  opacity_reset_interval: 3000
  densify_from_iter: 3000
  densify_until_iter: 15000
  densify_grad_threshold: 0.0002
  prune_min_opacity: 0.005
  densify_extent: 1.0
  max_n_gaussians: 524288
  percent_dense: 0.01
scene:
  name: scene_gs
  ckpt: null
  sh_degree: 3
  add_bg_points: false
  num_bg_points: 204800
  bg_sphere_dist: 5.0
  clean_pcd: false
  opt_start_iter: -1
  lr:
    percent_dense: 0.01
    spatial_scale: 1.0
    position_init: 0.00016
    position_final: 1.6e-06
    position_delay_mult: 0.01
    position_max_steps: 30000
    opacity: 0.05
    scaling: 0.005
    rotation: 0.001
    feature: 0.0025
  percent_dense: 0.01
  densification_interval: 100
  opacity_reset_interval: 20000
  densify_from_iter: 500
  densify_until_iter: 15000
  densify_grad_threshold: 0.0002
  prune_min_opacity: 0.005
  max_n_gaussians: 2097152
  loss:
    ssim_w: 0.2
    l1_w: 0.8

2024-11-25 16:46:39.840 | INFO     | hugs.trainer.gs_trainer:get_train_dataset:39 - Loading NeuMan dataset lab-train
reading cameras: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25731.93it/s]
reading images meta: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:00<00:00, 7704.89it/s]
reading point cloud: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9013/9013 [00:00<00:00, 160408.97it/s]
Computing near/far for ['bkg']: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:00<00:00, 726.33it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 82/82 [00:04<00:00, 16.83it/s]
2024-11-25 16:46:45.067 | INFO     | hugs.trainer.gs_trainer:get_val_dataset:54 - Loading NeuMan dataset lab-val
reading cameras: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22192.08it/s]
reading images meta: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:00<00:00, 7637.20it/s]
reading point cloud: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9013/9013 [00:00<00:00, 160942.33it/s]
Computing near/far for ['bkg']: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:00<00:00, 735.09it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 17.17it/s]
2024-11-25 16:46:45.988 | INFO     | hugs.trainer.gs_trainer:get_anim_dataset:62 - Loading NeuMan dataset lab-anim
reading cameras: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22550.02it/s]
reading images meta: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:00<00:00, 7749.12it/s]
reading point cloud: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9013/9013 [00:00<00:00, 161468.22it/s]
Computing near/far for ['bkg']: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:00<00:00, 719.20it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [00:00<00:00, 1228.85it/s]
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/home/ubuntu/hugs/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/ubuntu/hugs/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Loading model from: /home/ubuntu/hugs/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
/home/ubuntu/hugs/lib/python3.8/site-packages/lpips/lpips.py:107: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False)
2024-11-25 16:46:47.577 | INFO     | hugs.models.hugs_trimlp:create_betas:139 - Created betas with shape: torch.Size([10]), requires_grad: False
/home/ubuntu/hugs/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:134: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
2024-11-25 16:46:47.626 | INFO     | hugs.models.hugs_trimlp:__init__:109 - Subdividing SMPL model 2 times
# vertices before subdivision: (6890, 3)
# vertices before subdivision: (27554, 3)
/home/ubuntu/ml-hugs-main/hugs/models/hugs_trimlp.py:120: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
  self.edges = torch.from_numpy(edges).to(self.device).long()
2024-11-25 16:46:59.590 | INFO     | hugs.models.hugs_trimlp:create_betas:139 - Created betas with shape: torch.Size([10]), requires_grad: False
2024-11-25 16:47:39.465 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: xyz, lr: 0.00032
2024-11-25 16:47:39.466 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: v_embed, lr: 0.001
2024-11-25 16:47:39.466 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: geometry_dec, lr: 0.001
2024-11-25 16:47:39.466 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: appearance_dec, lr: 0.001
2024-11-25 16:47:39.466 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: deform_dec, lr: 0.0005
/home/ubuntu/hugs/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:60: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
  warnings.warn(
===== Ground truth values: =====
xyz_offsets torch.Size([110210, 3])
scales torch.Size([110210, 3])
rot6d_canon torch.Size([110210, 6])
shs torch.Size([110210, 16, 3])
opacity torch.Size([110210, 1])
lbs_weights torch.Size([110210, 24])
posedirs torch.Size([207, 330630])
deformed_normals torch.Size([110210, 3])
faces torch.Size([13776, 3])
edges torch.Size([330624, 2])
================================
2024-11-25 16:51:30.650 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: xyz, lr: 0.00032hs: 0.0000002, loss_opacity: 0.0000001, loss_lbs_weights: 0.0000150)
2024-11-25 16:51:30.651 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: v_embed, lr: 0.001
2024-11-25 16:51:30.651 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: geometry_dec, lr: 0.001
2024-11-25 16:51:30.651 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: appearance_dec, lr: 0.001
2024-11-25 16:51:30.651 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: deform_dec, lr: 0.0001
2024-11-25 16:51:30.652 | INFO     | hugs.trainer.gs_trainer:__init__:128 - HUGS TRIMLP:
xyz: torch.Size([110210, 3])
max_radii2D: torch.Size([110210])
xyz_gradient_accum: torch.Size([110210, 1])
denom: torch.Size([110210, 1])

2024-11-25 16:51:30.653 | INFO     | hugs.models.hugs_trimlp:create_betas:139 - Created betas with shape: torch.Size([10]), requires_grad: False
2024-11-25 16:51:30.683 | INFO     | hugs.models.hugs_trimlp:create_body_pose:130 - Created body pose with shape: torch.Size([82, 138]), requires_grad: True
2024-11-25 16:51:30.684 | INFO     | hugs.models.hugs_trimlp:create_global_orient:135 - Created global_orient with shape: torch.Size([82, 6]), requires_grad: True
2024-11-25 16:51:30.684 | INFO     | hugs.models.hugs_trimlp:create_transl:143 - Created transl with shape: torch.Size([82, 3]), requires_grad: True
2024-11-25 16:51:30.685 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: xyz, lr: 0.00032
2024-11-25 16:51:30.685 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: v_embed, lr: 0.001
2024-11-25 16:51:30.685 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: geometry_dec, lr: 0.001
2024-11-25 16:51:30.685 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: appearance_dec, lr: 0.001
2024-11-25 16:51:30.685 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: deform_dec, lr: 0.0001
2024-11-25 16:51:30.685 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: global_orient, lr: 0.0001
2024-11-25 16:51:30.686 | INFO     | hugs.models.hugs_trimlp:setup_optimizer:699 - Parameter: body_pose, lr: 0.0001
2024-11-25 16:51:30.686 | INFO     | hugs.trainer.gs_trainer:__init__:157 - SceneGS:
xyz: torch.Size([0])
features_dc: torch.Size([0])
features_rest: torch.Size([0])
scaling: torch.Size([0])
rotation: torch.Size([0])
opacity: torch.Size([0])
max_radii2D: torch.Size([0])
xyz_gradient_accum: torch.Size([0])
denom: torch.Size([0])

2024-11-25 16:51:30.687 | INFO     | hugs.models.scene:create_from_pcd:179 - Number of scene points at initialisation: 9013
Traceback (most recent call last):
  File "main.py", line 108, in <module>
    main(cfg)
  File "main.py", line 66, in main
    trainer = GaussianTrainer(cfg)
  File "/home/ubuntu/ml-hugs-main/hugs/trainer/gs_trainer.py", line 171, in __init__
    self.scene_gs.create_from_pcd(pcd, spatial_lr_scale)
  File "/home/ubuntu/ml-hugs-main/hugs/models/scene.py", line 181, in create_from_pcd
    dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)
MemoryError: std::bad_alloc: cudaErrorMemoryAllocation: out of memory

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions