-
Notifications
You must be signed in to change notification settings - Fork 27
Add transfer2mv posttraining recipe #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add transfer2mv posttraining recipe #77
Conversation
jingyijin2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution, @tiffanycai6!
Looking good. Some of the issues need to be addressed. :)
docs/recipes/post_training/transfer2/av_world_scenario_maps/post_training.md
Outdated
Show resolved
Hide resolved
docs/recipes/post_training/transfer2/av_world_scenario_maps/post_training.md
Show resolved
Hide resolved
docs/recipes/post_training/transfer2/av_world_scenario_maps/post_training.md
Outdated
Show resolved
Hide resolved
docs/recipes/post_training/transfer2/av_world_scenario_maps/post_training.md
Show resolved
Hide resolved
docs/recipes/post_training/transfer2/av_world_scenario_maps/post_training.md
Outdated
Show resolved
Hide resolved
docs/recipes/post_training/transfer2/av_world_scenario_maps/post_training.md
Show resolved
Hide resolved
docs/recipes/post_training/transfer2/av_world_scenario_maps/post_training.md
Show resolved
Hide resolved
docs/recipes/post_training/transfer2/av_world_scenario_maps/post_training.md
Outdated
Show resolved
Hide resolved
docs/recipes/post_training/transfer2/av_world_scenario_maps/post_training.md
Outdated
Show resolved
Hide resolved
jingyijin2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a master piece! Will be a so useful source of information to users. Thank you @tiffanycai6!
| @@ -0,0 +1,464 @@ | |||
| # Transfer2 Multiview Generation with World Scenario Map Control with Cosmos Predict2 Multiview | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiffanycai6 just realized that this is for Transfer 2.5, so we are naming the directory as transfer2_5, and model name being referenced as "Cosmos Transfer 2.5" (format standardized by marketing). Could you make those changes?
| - Uniform time weighting applied during training to handle AV data quality variations | ||
| - Shorter temporal window (8 latent frames vs 24 in other models) makes multiview task more tractable | ||
|
|
||
| The key difference from single-view Cosmos models is the use of a specialized `ExtractFramesAndCaptions` augmentor that adds multiview-specific keys to the data loader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this part, are the data processing code available in the CT2.5 code base inside of i4 package? If yes, could we add the paths here? If not, we may need to carve out those scripts to cookbook/scripts directory.
|
|
||
| Cosmos Transfer 2.5 Multiview extends the base Predict 2.5 Multiview model through a ControlNet architecture, which adds spatial conditioning capabilities while preserving the model's temporal and multi-view consistency. | ||
|
|
||
| The architecture consists of two main components: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT [optional] It would be nice to include a diagram if that is available. OK if it is already published in the paper. :)
|
|
||
| | **Model** | **Workload** | **Use Case** | | ||
| |-----------|--------------|--------------| | ||
| | Cosmos Transfer2 | Post-training | Spatially-conditioned multiview AV video generation with world scenario map control | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiffanycai6 I think its transfer2.5 here.
|
|
||
| Software Dependencies: | ||
|
|
||
| - Cosmos framework |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiffanycai6 , could you please hyperlink the Setup.md here?
|
|
||
| Pre-trained Models: | ||
|
|
||
| - Cosmos Predict 2.5 Multiview (2B parameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to link the download section of setup.md here.
| - Supports 7-view multi-camera configuration | ||
| - 29-frame context with 8 latent frame state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are under the Pre-trained Models heading, I think they shouldnt belong to pre-trained model field.
| Lane detection accuracy on generated videos showed significant deviation from ground truth layouts | ||
| 3D cuboid evaluation revealed geometric inconsistencies in object placement across viewpoints | ||
| These limitations confirmed the need for explicit spatial conditioning through visual control inputs rather than relying solely on text descriptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we were using FVD/ FID, CSE/TSe for Multiview eval, maybe you could point out to them, they are located at https://github.com/nvidia-cosmos/cosmos-cookbook/tree/main/scripts/metrics
| World scenario maps project HD maps and dynamic objects in the scene onto the seven camera views as the control input (see the below image). The world scenario map includes: | ||
|
|
||
| - **Map elements**: Lane lines (with fine-grained types such as dashed line, dotted line, double yellow line), poles, road boundaries, traffic lights with state, all directly rendered with appropriate colors and geometry patterns | ||
| - **Dynamic 3D bounding boxes**: Indicate positions of vehicles and pedestrians with occlusion-aware and heading-aware representations | ||
| - **Color coding**: Each object type is color-coded, with bounding boxes shaded according to the direction of motion, providing both semantic and motion cues | ||
| - **Complex road topologies**: Can represent overpasses and intricate road structures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add an illustraiton of HD Map here.
| You must provide a folder containing a collection of videos in **MP4 format**, preferably 720p, as well as a corresponding folder containing a collection of the hdmap control input videos in **MP4 format**. The views for each samples should be further stratified by subdirectories with the camera name. We have an example dataset that can be used at `assets/multiview_hdmap_posttrain_dataset` | ||
|
|
||
| #### 1.2 Verify the dataset folder format | ||
|
|
||
| Dataset folder format: | ||
|
|
||
| ``` | ||
| assets/multiview_hdmap_posttrain_dataset/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some details, how the user can get data in these field of views?
| transfer2_auto_multiview_post_train_example = dict( | ||
| defaults=[ | ||
| f"/experiment/{DEFAULT_CHECKPOINT.experiment}", | ||
| {"override /data_train": "example_multiview_train_data_control_input_hdmap"}, | ||
| ], | ||
| job=dict(project="cosmos_transfer_v2p5", group="auto_multiview", name="2b_cosmos_multiview_post_train_example"), | ||
| checkpoint=dict( | ||
| save_iter=200, | ||
| # pyrefly: ignore # missing-attribute | ||
| load_path=get_checkpoint_path(DEFAULT_CHECKPOINT.s3.uri), | ||
| load_training_state=False, | ||
| strict_resume=False, | ||
| load_from_object_store=dict( | ||
| enabled=False, # Loading from local filesystem, not S3 | ||
| ), | ||
| save_to_object_store=dict( | ||
| enabled=False, | ||
| ), | ||
| ), | ||
| model=dict( | ||
| config=dict( | ||
| base_load_from=None, | ||
| ), | ||
| ), | ||
| trainer=dict( | ||
| logging_iter=100, | ||
| max_iter=5_000, | ||
| callbacks=dict( | ||
| heart_beat=dict( | ||
| save_s3=False, | ||
| ), | ||
| iter_speed=dict( | ||
| hit_thres=200, | ||
| save_s3=False, | ||
| ), | ||
| device_monitor=dict( | ||
| save_s3=False, | ||
| ), | ||
| every_n_sample_reg=dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the VDR feedback for post-training, we would need to add dataloader config too.
| ```json | ||
| { | ||
| "name": "auto_multiview", | ||
| "prompt_path": "prompt.txt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to show whats in prompt.txt
| We measure perceptual quality and temporal consistency using: | ||
|
|
||
| - **FVD (Fréchet Video Distance)**: Evaluates visual quality and temporal coherence | ||
| - **FID (Fréchet Inception Distance)**: Measures frame-level visual quality | ||
| - **Temporal Sampson Error**: Assesses temporal consistency within each camera view | ||
| - **Cross-Camera Sampson Error**: Measures geometric consistency across multiple camera views |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics are located at https://github.com/nvidia-cosmos/cosmos-cookbook/tree/main/scripts/metrics
would be nice to add hyperlink.
| <img width="675" height="166" alt="Screenshot 2025-12-05 at 4 46 04 PM" src="https://github.com/user-attachments/assets/f3f8d209-dd1a-49c0-a92f-0260f49eed34" /> | ||
|
|
||
| We observe a significant boost (up to 2.3x improvement) in FVD/FID scores while remaining competitive in temporal and cross-camera Sampson error. This indicates that the model generates higher-quality, more realistic videos without sacrificing temporal consistency or multi-view geometric coherence. | ||
|
|
||
| **Spatial Accuracy Improvements**: | ||
|
|
||
| <img width="789" height="145" alt="Screenshot 2025-12-05 at 4 47 15 PM" src="https://github.com/user-attachments/assets/ea4829cf-61e1-4cdf-aeb9-0446564783aa" /> | ||
|
|
||
| We observe a substantial improvement (up to 60%) in detection metrics compared to Transfer1-7B-Sample-AV. The enhanced performance on both lane detection and 3D cuboid detection tasks demonstrates that the model generates videos with significantly better adherence to the spatial control signals provided by world scenario maps. | ||
|
|
||
| #### Qualitative Analysis | ||
|
|
||
| Visual comparisons between Cosmos-Transfer1-7B-Sample-AV and Cosmos-Transfer2.5-2B/auto/multiview reveal significant quality improvements. | ||
|
|
||
| <img width="828" height="634" alt="Screenshot 2025-12-05 at 4 48 24 PM" src="https://github.com/user-attachments/assets/c0332162-a38a-4bd0-bebf-5ab18ea25166" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please verify the asset paths.
Description
Brief description of the changes in this PR.
Add Cosmos Transfer2.5 Multiview posttraining recipe
Type of Change
Changes Made
Testing
Checklist
Additional Notes
Any additional information that reviewers should know.