Skip to content

Questions on Video Representation and Inference in VITRA #22

@UEFI-code

Description

@UEFI-code
  1. 2D->3D, Ground-Truth Data
    In the paper, it seems to be stated that the “raw videos” do not contain labeling information.
    I would like to ask: are the raw videos monocular 2D videos captured by a single camera? If so, do you then use some depth estimation algorithms to infer 3D information from them?

  2. "Brain" doesn't participate in autoregressive inference

        output_hs, inputs_masks = self.prepare_vlm_features(
            pixel_value,
            input_ids,
            attention_mask,
            current_state_mask,
            current_state,
            fov,
            use_cache=use_cache,
        )
        # handle multiple samples for one input
        samples, _ = self._forward_act_model(
            vlm_features = output_hs,
            attention_mask = inputs_masks,
            action_masks = x_mask,
            current_state = current_state,
            current_state_mask = current_state_mask,
            mode = "eval",
            repeated_diffusion_steps = sample_times,
            cfg_scale = cfg_scale,
            use_ddim = use_ddim,
            num_ddim_steps = num_ddim_steps,
        )
        action_np = samples.cpu().numpy() * x_mask.cpu().numpy()    # sample_times x T x D
        return action_np
  1. Open-Loop Robotic Ctrl, with NO feedback
    Robot break something -> Model never know

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions