-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
-
2D->3D, Ground-Truth Data
In the paper, it seems to be stated that the “raw videos” do not contain labeling information.
I would like to ask: are the raw videos monocular 2D videos captured by a single camera? If so, do you then use some depth estimation algorithms to infer 3D information from them? -
"Brain" doesn't participate in autoregressive inference
output_hs, inputs_masks = self.prepare_vlm_features(
pixel_value,
input_ids,
attention_mask,
current_state_mask,
current_state,
fov,
use_cache=use_cache,
)
# handle multiple samples for one input
samples, _ = self._forward_act_model(
vlm_features = output_hs,
attention_mask = inputs_masks,
action_masks = x_mask,
current_state = current_state,
current_state_mask = current_state_mask,
mode = "eval",
repeated_diffusion_steps = sample_times,
cfg_scale = cfg_scale,
use_ddim = use_ddim,
num_ddim_steps = num_ddim_steps,
)
action_np = samples.cpu().numpy() * x_mask.cpu().numpy() # sample_times x T x D
return action_np
- Open-Loop Robotic Ctrl, with NO feedback
Robot break something -> Model never know
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels