Questions on Video Representation and Inference in VITRA

1. 2D->3D, Ground-Truth Data
In the [paper](https://arxiv.org/pdf/2510.21571), it seems to be stated that the “raw videos” do not contain labeling information.
I would like to ask: are the raw videos monocular 2D videos captured by a single camera? If so, do you then use some depth estimation algorithms to infer 3D information from them?

2. "Brain" doesn't participate in autoregressive inference
```
        output_hs, inputs_masks = self.prepare_vlm_features(
            pixel_value,
            input_ids,
            attention_mask,
            current_state_mask,
            current_state,
            fov,
            use_cache=use_cache,
        )
        # handle multiple samples for one input
        samples, _ = self._forward_act_model(
            vlm_features = output_hs,
            attention_mask = inputs_masks,
            action_masks = x_mask,
            current_state = current_state,
            current_state_mask = current_state_mask,
            mode = "eval",
            repeated_diffusion_steps = sample_times,
            cfg_scale = cfg_scale,
            use_ddim = use_ddim,
            num_ddim_steps = num_ddim_steps,
        )
        action_np = samples.cpu().numpy() * x_mask.cpu().numpy()    # sample_times x T x D
        return action_np
```

3. Open-Loop Robotic Ctrl, with NO feedback
Robot break something -> Model never know

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on Video Representation and Inference in VITRA #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions on Video Representation and Inference in VITRA #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions