diff --git a/docs/core_concepts/data_curation/overview.md b/docs/core_concepts/data_curation/overview.md index d8a66b5..6b5a8a0 100644 --- a/docs/core_concepts/data_curation/overview.md +++ b/docs/core_concepts/data_curation/overview.md @@ -1,7 +1,6 @@ # Data Curation -> **Authors:** [Jingyi Jin](https://www.linkedin.com/in/jingyi-jin) • [Alice Luo](https://www.linkedin.com/in/aliceluoqian) -> **Organization:** NVIDIA +> **Authors:** [Jingyi Jin](https://www.linkedin.com/in/jingyi-jin) • [Alice Luo](https://www.linkedin.com/in/aliceluoqian) > **Organization:** NVIDIA ## Overview @@ -30,12 +29,13 @@ Data curation is a complex, multi-stage process. As shown below, it systematical ![Comprehensive Data Curation Pipeline](images/data_curation_pipeline.png) -The **Cosmos video curation pipeline**—first established in *Cosmos-Predict1* and later scaled in *Cosmos-Predict2.5*—consists of seven stages: +The **Cosmos video curation pipeline**—first established in _Cosmos-Predict1_ and later scaled in _Cosmos-Predict2.5_—consists of seven stages: 1. **Shot-Aware Video Splitting** – Long-form videos are segmented into coherent clips using shot boundary detection. Short (<5 s) clips are discarded, while longer ones (5–60 s) form the basis for downstream curation. 2. **GPU-Based Transcoding** – Each clip is transcoded in parallel to optimize format, frame rate, and compression quality for model ingestion. 3. **Video Cropping** – Black borders, letterboxing, and spatial padding are removed to ensure consistent aspect ratios. 4. **Filtering** – A multi-stage filtering pipeline removes unsuitable data. Filters include: + - **Aesthetic Quality Filter** – Screens for poor composition or lighting. - **Motion Filter** – Removes clips with excessive or insufficient movement. - **OCR Filter** – Detects overlays, watermarks, or subtitles. @@ -56,13 +56,13 @@ The result is a dataset that is **clean, diverse, and semantically organized** ## From Pre-Training to Post-Training -Although *Cosmos-Predict2.5* operates at petabyte scale, its principles directly inform post-training data practices: +Although _Cosmos-Predict2.5_ operates at petabyte scale, its principles directly inform post-training data practices: - **Scale down, specialize up:** Post-training uses smaller but more domain-specific datasets. -- **Refine rather than expand:** Instead of collecting more data, focus on *improving alignment* and *removing noise*. +- **Refine rather than expand:** Instead of collecting more data, focus on _improving alignment_ and _removing noise_. - **Iterate via feedback loops:** Use model evaluation results to guide the next round of curation—closing the loop between data and learning outcomes. -In other words, post-training data curation inherits the *structure* of pre-training pipelines but applies it to **targeted, feedback-driven refinement**. +In other words, post-training data curation inherits the _structure_ of pre-training pipelines but applies it to **targeted, feedback-driven refinement**. --- @@ -75,33 +75,44 @@ Data sourcing involves acquiring datasets from diverse locations—internal stor ### Cloud Storage Tools -| Tool | Purpose | Best For | -|------|----------|----------| -| **s5cmd** | High-performance S3-compatible storage client | Large-scale parallel transfers | -| **AWS CLI** | Official AWS command-line tool | AWS-native workflows | -| **rclone** | Multi-cloud sync for 70+ providers | Complex multi-cloud setups | +| Tool | Purpose | Best For | +| ----------- | --------------------------------------------- | ------------------------------ | +| **s5cmd** | High-performance S3-compatible storage client | Large-scale parallel transfers | +| **AWS CLI** | Official AWS command-line tool | AWS-native workflows | +| **rclone** | Multi-cloud sync for 70+ providers | Complex multi-cloud setups | ### Web Content Tools -| Tool | Purpose | Best For | -|------|----------|----------| -| **HuggingFace CLI** | Access to model/dataset repositories | Community datasets and checkpoints | -| **yt-dlp** | High-throughput video downloader | Batch ingestion and quality selection | -| **wget/curl** | General-purpose file downloaders | API retrieval and recursive crawling | +| Tool | Purpose | Best For | +| ------------------- | ------------------------------------ | ------------------------------------- | +| **HuggingFace CLI** | Access to model/dataset repositories | Community datasets and checkpoints | +| **yt-dlp** | High-throughput video downloader | Batch ingestion and quality selection | +| **wget/curl** | General-purpose file downloaders | API retrieval and recursive crawling | + +### Physical AI Datasets + +For Physical AI developers working with Cosmos models, NVIDIA provides open, curated, and commercial-grade datasets for Physical AI development in **[NVIDIA Physical AI Collection](https://huggingface.co/collections/nvidia/physical-ai)** on Hugging Face, including: + +- Autonomous vehicle datasets (driving scenes, synthetic data, teleoperation) +- Robotics datasets (GR00T, manipulation, grasping, navigation) +- Smart spaces and warehouse datasets +- Domain-specific training and evaluation datasets + +These datasets are designed to work seamlessly with Cosmos models and can serve as starting points for domain-specific post-training workflows. ### Data Processing Tools -| Tool | Purpose | Best For | -|------|----------|----------| -| **ffmpeg** | Video transcoding and frame extraction | Reformatting and quality control | -| **PIL/Pillow** | Python imaging library | Lightweight image manipulation | +| Tool | Purpose | Best For | +| -------------- | -------------------------------------- | -------------------------------- | +| **ffmpeg** | Video transcoding and frame extraction | Reformatting and quality control | +| **PIL/Pillow** | Python imaging library | Lightweight image manipulation | ### Quality Control Tools -| Tool | Purpose | Best For | -|------|----------|----------| -| **OpenCV** | Computer vision toolkit | Visual inspection and analysis | -| **FFprobe** | Metadata extraction | Duration, codec, and resolution stats | +| Tool | Purpose | Best For | +| ----------- | ----------------------- | ------------------------------------- | +| **OpenCV** | Computer vision toolkit | Visual inspection and analysis | +| **FFprobe** | Metadata extraction | Duration, codec, and resolution stats | --- diff --git a/docs/faq.md b/docs/faq.md index da76317..8c0ecef 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -62,6 +62,17 @@ The Cosmos platform provides the following capabilities: **Post-training:** Cosmos WFMs are fully customizable to develop downstream vision, robotics or autonomous vehicle foundation models tailored for customer data. Post-training can be done to change output type, output quantity, output quality, output style or output point of view. +### Where can I find datasets for training Physical AI models? + +NVIDIA provides curated, open, commercial-grade datasets for Physical AI development on the [NVIDIA Physical AI Collection](https://huggingface.co/collections/nvidia/physical-ai) on Hugging Face. This collection includes datasets for: + +- **Autonomous vehicles**: Driving scenes, synthetic data, and teleoperation datasets +- **Robotics**: GR00T, manipulation, grasping, and navigation datasets +- **Smart spaces and warehouses**: Multi-camera tracking, detection, and spatial intelligence datasets +- **Domain-specific training and evaluation**: Specialized datasets for various Physical AI applications + +These datasets are designed to work seamlessly with Cosmos models and can serve as starting points for domain-specific post-training workflows. + ### How do Cosmos models differ from other video foundation models? Cosmos world foundation models are designed specifically for physical AI applications. The models are openly available and customizable, with Cosmos Predict and Cosmos Reason supporting post-training for autonomous vehicle, robotics, and vision-action generation models. @@ -279,6 +290,7 @@ Existing NVIDIA Omniverse Enterprise (NVOVE) licenses can be used for Cosmos ent - **Documentation**: Comprehensive guides in each repository - **Examples**: Reference implementations and tutorials - **Community Forums**: Engage with other developers +- **Physical AI Datasets**: Access curated datasets for autonomous vehicles, robotics, smart spaces, and warehouse environments on the [NVIDIA Physical AI Collection](https://huggingface.co/collections/nvidia/physical-ai) on Hugging Face #### Official Channels diff --git a/docs/index.md b/docs/index.md index 2fa8b38..a35abcb 100644 --- a/docs/index.md +++ b/docs/index.md @@ -25,6 +25,8 @@ The Cosmos Cookbook is an open-source resource where NVIDIA and the broader Phys We welcome contributions—from new examples and workflow improvements to bug fixes and documentation updates. Together, we can evolve best practices and accelerate the adoption of Cosmos models across domains. +**📊 Physical AI Datasets:** Access curated datasets for autonomous vehicles, intelligent transportation systems, robotics, smart spaces, and warehouse environments on the [NVIDIA Physical AI Collection](https://huggingface.co/collections/nvidia/physical-ai) on Hugging Face. + ## Case Study Recipes The cookbook includes comprehensive use cases demonstrating real-world applications across the Cosmos platform. @@ -60,18 +62,18 @@ The cookbook includes comprehensive use cases demonstrating real-world applicati #### Vision-language reasoning and quality control -| **Workflow** | **Description** | **Link** | -|--------------|-----------------|----------| -| **Training** | Physical plausibility check for video quality assessment | [Video Rewards](recipes/post_training/reason1/physical-plausibility-check/post_training.md) | -| **Training** | Spatial AI understanding for warehouse environments | [Spatial AI Warehouse](recipes/post_training/reason1/spatial-ai-warehouse/post_training.md) | -| **Training** | Intelligent transportation scene understanding and analysis | [Intelligent Transportation](recipes/post_training/reason1/intelligent-transportation/post_training.md) | -| **Training** | AV video captioning and visual question answering for autonomous vehicles | [AV Video Caption VQA](recipes/post_training/reason1/av_video_caption_vqa/post_training.md) | -| **Training** | Temporal localization for MimicGen robot learning data generation | [Temporal Localization](recipes/post_training/reason1/temporal_localization/post_training.md) | +| **Workflow** | **Description** | **Link** | +| ------------ | ------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- | +| **Training** | Physical plausibility check for video quality assessment | [Video Rewards](recipes/post_training/reason1/physical-plausibility-check/post_training.md) | +| **Training** | Spatial AI understanding for warehouse environments | [Spatial AI Warehouse](recipes/post_training/reason1/spatial-ai-warehouse/post_training.md) | +| **Training** | Intelligent transportation scene understanding and analysis | [Intelligent Transportation](recipes/post_training/reason1/intelligent-transportation/post_training.md) | +| **Training** | AV video captioning and visual question answering for autonomous vehicles | [AV Video Caption VQA](recipes/post_training/reason1/av_video_caption_vqa/post_training.md) | +| **Training** | Temporal localization for MimicGen robot learning data generation | [Temporal Localization](recipes/post_training/reason1/temporal_localization/post_training.md) | ### **Cosmos Curator** -| **Workflow** | **Description** | **Link** | -|--------------|-----------------|----------| +| **Workflow** | **Description** | **Link** | +| ------------ | ---------------------------------------------------- | ------------------------------------------------------------------------------- | | **Curation** | Curate video data for Cosmos Predict 2 post-training | [Predict 2 Data Curation](recipes/data_curation/predict2_data/data_curation.md) | ### **End-to-End Workflows** @@ -124,6 +126,7 @@ Visual examples of Cosmos Transfer results across Physical AI domains: This cookbook provides flexible entry points for both **inference** and **training** workflows. Each section contains runnable scripts, technical recipes, and complete examples. - **Inference workflows:** [Getting Started](getting_started/setup.md) for setup and immediate model deployment +- **Physical AI datasets:** [NVIDIA Physical AI Collection](https://huggingface.co/collections/nvidia/physical-ai) on Hugging Face for curated datasets across domains - **Data processing:** [Data Processing & Analysis](core_concepts/data_curation/overview.md) for content analysis workflows - **Training workflows:** [Model Training & Fine-tuning](core_concepts/post_training/overview.md) for domain adaptation - **Case study recipes:** [Case Study Recipes](#case-study-recipes) organized by application area diff --git a/docs/recipes/inference/transfer1/inference-warehouse-mv/inference.md b/docs/recipes/inference/transfer1/inference-warehouse-mv/inference.md index bbbf2fe..26e1ef3 100644 --- a/docs/recipes/inference/transfer1/inference-warehouse-mv/inference.md +++ b/docs/recipes/inference/transfer1/inference-warehouse-mv/inference.md @@ -1,13 +1,12 @@ # Cosmos Transfer 1 Sim2Real for Multi-View Warehouse Detection and Tracking -> **Authors:** [Alice Li](https://www.linkedin.com/in/alice-li-17439713b/) • [Thomas Tang](https://www.linkedin.com/in/zhengthomastang/) • [Yuxing Wang](https://www.linkedin.com/in/yuxing-wang-55394620b/) • [Jingyi Jin](https://www.linkedin.com/in/jingyi-jin) -> **Organization:** NVIDIA +> **Authors:** [Alice Li](https://www.linkedin.com/in/alice-li-17439713b/) • [Thomas Tang](https://www.linkedin.com/in/zhengthomastang/) • [Yuxing Wang](https://www.linkedin.com/in/yuxing-wang-55394620b/) • [Jingyi Jin](https://www.linkedin.com/in/jingyi-jin) > **Organization:** NVIDIA -| Model | Workload | Use case | -|------|----------|----------| +| Model | Workload | Use case | +| ----------------- | --------- | ----------------------------- | | Cosmos Transfer 1 | Inference | Sim to Real data augmentation | -This use case demonstrates how to apply Cosmos Transfer 1 for data augmentation over Omniverse (OV) generated synthetic data to close the sim-to-real domain gap, specifically targeting multi-view warehouse detection and tracking scenarios. +This use case demonstrates how to apply Cosmos Transfer 1 for data augmentation over Omniverse (OV) generated synthetic data to close the sim-to-real domain gap, specifically targeting multi-view warehouse detection and tracking scenarios. - [Setup and System Requirement](setup.md) @@ -22,7 +21,7 @@ This use case explores how first time outside-in multi-view world simulation can Monitoring of warehouse spaces typically involves multi-camera views to provide comprehensive coverage. Since Cosmos Transfer 1 does not natively support multi-view processing, we adopt an approach to ensure visual consistency across all camera viewpoints: 1. **Multi-view Outside-In Data Generation**: Multi-view synthetic videos and corresponding multi-modal -ground truth data (e.g., depth, segmentation masks) are prepared by [IsaacSim.Replicator.Agent](https://docs.isaacsim.omniverse.nvidia.com/latest/index.html). + ground truth data (e.g., depth, segmentation masks) are prepared by [IsaacSim.Replicator.Agent](https://docs.isaacsim.omniverse.nvidia.com/latest/index.html). 2. **Processing**: For each video, identical text prompts and parameter settings are provided to the Cosmos Transfer 1 model, ensuring uniformity across different camera views. Modalities are carefully chosen and analyzed to enhance object feature consistency across different camera views. Detailed, data-driven text prompts are employed to minimize divergence in object features between views. In the following case, only depth and edge maps (0.5 depth + 0.5 edge) are used as input controls to the Cosmos Transfer 1 model. This approach guarantees that all camera views receive consistent environmental transformations while maintaining spatial and temporal coherence across the multi-view setup. @@ -53,6 +52,10 @@ The dataset provides a 6-camera warehouse setup with synchronized data organized - **`rgb.mp4`**: RGB video data for the camera view - **`depth.mp4`**: Corresponding depth video data for the camera view +### Additional Physical AI Smart Spaces Datasets + +For additional multi-camera warehouse datasets, see the [NVIDIA PhysicalAI-SmartSpaces dataset](https://huggingface.co/datasets/nvidia/PhysicalAI-SmartSpaces) on Hugging Face, which includes over 250 hours of synchronized multi-camera video with 2D/3D annotations, depth maps, and calibration data. + ### Warehouse Outside-In Multi-View Input The RGB video for each camera is processed sequentially through multiple Cosmos Transfer 1 inference runs. We present the concatenated multi-view videos below to demonstrate the combined perspectives. @@ -79,15 +82,15 @@ We can leverage the Cosmos Transfer 1 model to convert the appearance of synthet ```json { - "prompt": "The camera provides a clear view of the warehouse interior, showing rows of shelves stacked with boxes, baskets and other objects. There are workers and robots walking around, moving boxes or operating machinery. The lighting is bright and even, with overhead fluorescent lights illuminating the space. The floor is clean and well-maintained, with clear pathways between the shelves. The atmosphere is busy but organized, with workers and humanoids moving efficiently around the warehouse.", - "input_video_path": "/your_video_path/rgb_video.mp4", - "edge": { - "control_weight": 0.5 - }, - "depth": { - "control_weight": 0.5, - "input_control": "/your_video_path/depth_video.mp4" - } + "prompt": "The camera provides a clear view of the warehouse interior, showing rows of shelves stacked with boxes, baskets and other objects. There are workers and robots walking around, moving boxes or operating machinery. The lighting is bright and even, with overhead fluorescent lights illuminating the space. The floor is clean and well-maintained, with clear pathways between the shelves. The atmosphere is busy but organized, with workers and humanoids moving efficiently around the warehouse.", + "input_video_path": "/your_video_path/rgb_video.mp4", + "edge": { + "control_weight": 0.5 + }, + "depth": { + "control_weight": 0.5, + "input_control": "/your_video_path/depth_video.mp4" + } } ``` @@ -108,12 +111,12 @@ Example control and text prompts: ```json { -"prompt": "The camera provides a photorealistic view of a dimly lit warehouse shrouded in thick fog, where shelves emerge like silhouettes through the mist, holding crates covered in dew. Workers wear diversed layered clothing in muted tones, including utility jackets, waterproof pants, and sturdy boots, some paired with scarves and beanies to counter the chill. Humanoid robots with matte black finishes and faintly glowing outlines navigate through the haze, their movements slow and deliberate. Forklifts, painted in industrial gray with fog lights attached, glide silently across the damp concrete floor. The lighting is eerie and diffused, with beams from overhead fixtures piercing the mist to create dramatic light shafts. The atmosphere is mysterious and quiet, with the muffled sound of machinery barely audible through the thick air.", -"prompt_title": "The camera provides a photorealistic", -"input_video_path": "/your_video_path/rgb_video.mp4", -"edge": { + "prompt": "The camera provides a photorealistic view of a dimly lit warehouse shrouded in thick fog, where shelves emerge like silhouettes through the mist, holding crates covered in dew. Workers wear diversed layered clothing in muted tones, including utility jackets, waterproof pants, and sturdy boots, some paired with scarves and beanies to counter the chill. Humanoid robots with matte black finishes and faintly glowing outlines navigate through the haze, their movements slow and deliberate. Forklifts, painted in industrial gray with fog lights attached, glide silently across the damp concrete floor. The lighting is eerie and diffused, with beams from overhead fixtures piercing the mist to create dramatic light shafts. The atmosphere is mysterious and quiet, with the muffled sound of machinery barely audible through the thick air.", + "prompt_title": "The camera provides a photorealistic", + "input_video_path": "/your_video_path/rgb_video.mp4", + "edge": { "control_weight": 1.0 - } + } } ``` @@ -133,26 +136,25 @@ Diversity can be further enhanced by dividing each camera view video into 10 seg ### Recommended Control Configuration -Similar to the weather augmentation approach, experiments show that controlling only for *edge and depth* produces the best results for warehouse sim-to-real conversion. This configuration maintains structural consistency while allowing realistic appearance changes. +Similar to the weather augmentation approach, experiments show that controlling only for _edge and depth_ produces the best results for warehouse sim-to-real conversion. This configuration maintains structural consistency while allowing realistic appearance changes. ```json - { - "prompt": "The camera provides a clear view of the warehouse interior, showing rows of shelves stacked with boxes, baskets and other objects. There are workers and robots walking around, moving boxes or operating machinery. The lighting is bright and even, with overhead fluorescent lights illuminating the space. The floor is clean and well-maintained, with clear pathways between the shelves. The atmosphere is busy but organized, with workers and humanoids moving efficiently around the warehouse.", - "input_video_path": "/mnt/pvc/gradio/uploads/upload_20250916_152159/rgb.mp4", - "edge": { - "control_weight": 0.5 - }, - "depth": { - "control_weight": 0.5, - "input_control": "/mnt/pvc/gradio/uploads/upload_20250916_152159/depth.mp4" - } + "prompt": "The camera provides a clear view of the warehouse interior, showing rows of shelves stacked with boxes, baskets and other objects. There are workers and robots walking around, moving boxes or operating machinery. The lighting is bright and even, with overhead fluorescent lights illuminating the space. The floor is clean and well-maintained, with clear pathways between the shelves. The atmosphere is busy but organized, with workers and humanoids moving efficiently around the warehouse.", + "input_video_path": "/mnt/pvc/gradio/uploads/upload_20250916_152159/rgb.mp4", + "edge": { + "control_weight": 0.5 + }, + "depth": { + "control_weight": 0.5, + "input_control": "/mnt/pvc/gradio/uploads/upload_20250916_152159/depth.mp4" + } } ``` ## 2D Detection Results on Augmented Dataset -To evaluate the effectiveness of Cosmos Transfer 1 for data augmentation, experiments were conducted using carefully selected multi-view scenes from the AI City Challenge dataset. *Eleven distinct scenes* were picked from the [AI City v0.1](https://www.aicitychallenge.org/) dataset, representing diverse warehouse and indoor environments. +To evaluate the effectiveness of Cosmos Transfer 1 for data augmentation, experiments were conducted using carefully selected multi-view scenes from the AI City Challenge dataset. _Eleven distinct scenes_ were picked from the [AI City v0.1](https://www.aicitychallenge.org/) dataset, representing diverse warehouse and indoor environments. Each of these 11 baseline scenes was processed through the Cosmos Transfer 1 augmentation pipeline using the multi-view parameter-consistent approach described earlier. This process generated ambient variations, lighting changes, and environmental conditions (including dust and reduced visibility scenarios), while maintaining structural consistency and multi-view coherence across all camera viewpoints. @@ -160,11 +162,11 @@ The resulting augmented dataset, containing both original and Cosmos Transfer-en ### Detection Performance Results -| Dataset Configuration | Pretrained Checkpoint | Building K Person AP50 | Building K Nova Carter AP50 | Building K mAP50 | -|----------------------|----------------------|------------------------|----------------------------|------------------| -| **Baseline: 1-min IsaacSim AICity v0.1** | NVImageNetV2 backbone | 0.776 | 0.478 | 0.627 | -| **1-min Cosmos AICity v0.1** | NVImageNetV2 backbone | 0.827 (+6.17%) | 0.545 (+12.35%) | 0.686 (+8.60%) | -| **1-min IsaacSim + 1-min Cosmos AICity v0.1** | NVImageNetV2 backbone | 0.838 (+7.40%) | 0.645 (+25.94%) | 0.742 (+15.50%) | +| Dataset Configuration | Pretrained Checkpoint | Building K Person AP50 | Building K Nova Carter AP50 | Building K mAP50 | +| --------------------------------------------- | --------------------- | ---------------------- | --------------------------- | ---------------- | +| **Baseline: 1-min IsaacSim AICity v0.1** | NVImageNetV2 backbone | 0.776 | 0.478 | 0.627 | +| **1-min Cosmos AICity v0.1** | NVImageNetV2 backbone | 0.827 (+6.17%) | 0.545 (+12.35%) | 0.686 (+8.60%) | +| **1-min IsaacSim + 1-min Cosmos AICity v0.1** | NVImageNetV2 backbone | 0.838 (+7.40%) | 0.645 (+25.94%) | 0.742 (+15.50%) | ## Conclusion