Skip to content

Conversation

@tiffanycai6
Copy link

Description

Brief description of the changes in this PR.
Add Cosmos Transfer2.5 Multiview posttraining recipe

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement

Changes Made

  • Added/updated documentation
  • Added/updated examples
  • Fixed bugs or issues
  • Improved code quality
  • Updated dependencies

Testing

  • I have tested the changes locally
  • Documentation builds successfully
  • Pre-commit hooks pass
  • Examples run without errors
  • Links and references are valid

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

Any additional information that reviewers should know.

Copy link
Collaborator

@jingyijin2 jingyijin2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, @tiffanycai6!
Looking good. Some of the issues need to be addressed. :)

Copy link
Collaborator

@jingyijin2 jingyijin2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a master piece! Will be a so useful source of information to users. Thank you @tiffanycai6!

@@ -0,0 +1,464 @@
# Transfer2 Multiview Generation with World Scenario Map Control with Cosmos Predict2 Multiview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiffanycai6 just realized that this is for Transfer 2.5, so we are naming the directory as transfer2_5, and model name being referenced as "Cosmos Transfer 2.5" (format standardized by marketing). Could you make those changes?

- Uniform time weighting applied during training to handle AV data quality variations
- Shorter temporal window (8 latent frames vs 24 in other models) makes multiview task more tractable​

The key difference from single-view Cosmos models is the use of a specialized `ExtractFramesAndCaptions` augmentor that adds multiview-specific keys to the data loader:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this part, are the data processing code available in the CT2.5 code base inside of i4 package? If yes, could we add the paths here? If not, we may need to carve out those scripts to cookbook/scripts directory.


Cosmos Transfer 2.5 Multiview extends the base Predict 2.5 Multiview model through a ControlNet architecture, which adds spatial conditioning capabilities while preserving the model's temporal and multi-view consistency.

The architecture consists of two main components:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT [optional] It would be nice to include a diagram if that is available. OK if it is already published in the paper. :)


| **Model** | **Workload** | **Use Case** |
|-----------|--------------|--------------|
| Cosmos Transfer2 | Post-training | Spatially-conditioned multiview AV video generation with world scenario map control |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiffanycai6 I think its transfer2.5 here.


Software Dependencies:

- Cosmos framework
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiffanycai6 , could you please hyperlink the Setup.md here?


Pre-trained Models:

- Cosmos Predict 2.5 Multiview (2B parameters)​
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to link the download section of setup.md here.

Comment on lines +33 to +34
- Supports 7-view multi-camera configuration
- 29-frame context with 8 latent frame state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are under the Pre-trained Models heading, I think they shouldnt belong to pre-trained model field.

Comment on lines +55 to +57
Lane detection accuracy on generated videos showed significant deviation from ground truth layouts​
3D cuboid evaluation revealed geometric inconsistencies in object placement across viewpoints
These limitations confirmed the need for explicit spatial conditioning through visual control inputs rather than relying solely on text descriptions.​
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we were using FVD/ FID, CSE/TSe for Multiview eval, maybe you could point out to them, they are located at https://github.com/nvidia-cosmos/cosmos-cookbook/tree/main/scripts/metrics

Comment on lines +70 to +75
World scenario maps project HD maps and dynamic objects in the scene onto the seven camera views as the control input (see the below image). The world scenario map includes:​

- **Map elements**: Lane lines (with fine-grained types such as dashed line, dotted line, double yellow line), poles, road boundaries, traffic lights with state, all directly rendered with appropriate colors and geometry patterns​
- **Dynamic 3D bounding boxes**: Indicate positions of vehicles and pedestrians with occlusion-aware and heading-aware representations​
- **Color coding**: Each object type is color-coded, with bounding boxes shaded according to the direction of motion, providing both semantic and motion cues​
- **Complex road topologies**: Can represent overpasses and intricate road structures​
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add an illustraiton of HD Map here.

Comment on lines +184 to +191
You must provide a folder containing a collection of videos in **MP4 format**, preferably 720p, as well as a corresponding folder containing a collection of the hdmap control input videos in **MP4 format**. The views for each samples should be further stratified by subdirectories with the camera name. We have an example dataset that can be used at `assets/multiview_hdmap_posttrain_dataset`

#### 1.2 Verify the dataset folder format

Dataset folder format:

```
assets/multiview_hdmap_posttrain_dataset/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some details, how the user can get data in these field of views?

Comment on lines +242 to +280
transfer2_auto_multiview_post_train_example = dict(
defaults=[
f"/experiment/{DEFAULT_CHECKPOINT.experiment}",
{"override /data_train": "example_multiview_train_data_control_input_hdmap"},
],
job=dict(project="cosmos_transfer_v2p5", group="auto_multiview", name="2b_cosmos_multiview_post_train_example"),
checkpoint=dict(
save_iter=200,
# pyrefly: ignore # missing-attribute
load_path=get_checkpoint_path(DEFAULT_CHECKPOINT.s3.uri),
load_training_state=False,
strict_resume=False,
load_from_object_store=dict(
enabled=False, # Loading from local filesystem, not S3
),
save_to_object_store=dict(
enabled=False,
),
),
model=dict(
config=dict(
base_load_from=None,
),
),
trainer=dict(
logging_iter=100,
max_iter=5_000,
callbacks=dict(
heart_beat=dict(
save_s3=False,
),
iter_speed=dict(
hit_thres=200,
save_s3=False,
),
device_monitor=dict(
save_s3=False,
),
every_n_sample_reg=dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the VDR feedback for post-training, we would need to add dataloader config too.

```json
{
"name": "auto_multiview",
"prompt_path": "prompt.txt",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to show whats in prompt.txt

Comment on lines +400 to +405
We measure perceptual quality and temporal consistency using:

- **FVD (Fréchet Video Distance)**: Evaluates visual quality and temporal coherence
- **FID (Fréchet Inception Distance)**: Measures frame-level visual quality
- **Temporal Sampson Error**: Assesses temporal consistency within each camera view
- **Cross-Camera Sampson Error**: Measures geometric consistency across multiple camera views
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are located at https://github.com/nvidia-cosmos/cosmos-cookbook/tree/main/scripts/metrics
would be nice to add hyperlink.

Comment on lines +421 to +435
<img width="675" height="166" alt="Screenshot 2025-12-05 at 4 46 04 PM" src="https://github.com/user-attachments/assets/f3f8d209-dd1a-49c0-a92f-0260f49eed34" />

We observe a significant boost (up to 2.3x improvement) in FVD/FID scores while remaining competitive in temporal and cross-camera Sampson error. This indicates that the model generates higher-quality, more realistic videos without sacrificing temporal consistency or multi-view geometric coherence.

**Spatial Accuracy Improvements**:

<img width="789" height="145" alt="Screenshot 2025-12-05 at 4 47 15 PM" src="https://github.com/user-attachments/assets/ea4829cf-61e1-4cdf-aeb9-0446564783aa" />

We observe a substantial improvement (up to 60%) in detection metrics compared to Transfer1-7B-Sample-AV. The enhanced performance on both lane detection and 3D cuboid detection tasks demonstrates that the model generates videos with significantly better adherence to the spatial control signals provided by world scenario maps.

#### Qualitative Analysis

Visual comparisons between Cosmos-Transfer1-7B-Sample-AV and Cosmos-Transfer2.5-2B/auto/multiview reveal significant quality improvements.

<img width="828" height="634" alt="Screenshot 2025-12-05 at 4 48 24 PM" src="https://github.com/user-attachments/assets/c0332162-a38a-4bd0-bebf-5ab18ea25166" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify the asset paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants