Skip to content

Soohti/CubeVLA

Repository files navigation

CubeVLA

Course Project for COMS 4733 Computational Aspects of Robotics, Fall 2025, Columbia University.

Overview

CubeVLA is a framework for long-horizon cube manipulation and arrangement using VLM-guided planning. It proposes a hierarchical framework that utilizes Qwen-2.5-VL-3B-Instruct as a high-level planner and a fine-tuned $π_{0.5}$ model as the low-level actor. The system is validated on the ManipulationNet benchmark within the Genesis simulator.

proj3

Installation

Data generation and inference scripts are tested on Linux (with CUDA), macOS (with Metal), and Google Colab. Fine-tuning scripts are tested on Linux (with CUDA) and Google Colab.

A custom openpi package is included as a submodule and is required for running the project on macOS and Google Colab.

When cloning this repo, make sure to update submodules:

git clone --recurse-submodules git@github.com:Soohti/CubeVLA.git

# Or if you already cloned the repo:
git submodule update --init --recursive

Run the following to set up the environment:

GIT_LFS_SKIP_SMUDGE=1 uv sync

Environment Variables

Create a .env file in the root directory with the following content:

OPENPI_DATA_HOME=./pi05_droid
CONFIG_NAME=pi05_droid
MODEL_DOWNLOAD_URL=gs://openpi-assets/checkpoints/pi05_droid
MODEL_CHECKPOINT_PATH=./pi05_droid/openpi-assets/checkpoints/pi05_droid

DATASET_COPY_WORKERS=10
REPO_ID=soohti/droid-ft

QWEN_ID=Qwen/Qwen2.5-VL-3B-Instruct
QWEN_PATH=./qwen2.5

Scripts

  • download.py: Download $\pi_{0.5}$-DROID and Qwen-2.5-VL-3B-Instruct base model checkpoints.
  • run_base.py: Run inference using the $\pi_{0.5}$-DROID base model only. Does not support hierarchical planning.
  • data_generation.py: Generate training data using the simulator.
  • run_predict.py: Run CubeVLA inference using the (fine-tuned) Qwen-2.5-VL and $\pi_{0.5}$ models.

Fine-tuning $\pi_{0.5}$-DROID

The following TrainConfig is used for fine-tuning:

TrainConfig(
    name="pi05_droid_ft",
    model=pi0_config.Pi0Config(pi05=True, action_horizon=16, discrete_state_input=False),
    data=LeRobotDROIDDataConfig(
        repo_id="soohti/droid-ft",
        base_config=DataConfig(prompt_from_task=True),
        # extra_delta_transform=True,
        assets=AssetsConfig(
            # Important: reuse the original DROID norm stats during fine-tuning!
            assets_dir="pi05_droid/assets",
            asset_id="droid",
        ),
    ),
    batch_size=64,
    lr_schedule=_optimizer.CosineDecaySchedule(
        # warmup_steps=10_000,
        warmup_steps=100,
        peak_lr=5e-5,
        # decay_steps=1_000_000,
        decay_steps=200,
        decay_lr=5e-5,
    ),
    optimizer=_optimizer.AdamW(clip_gradient_norm=1.0),
    weight_loader=weight_loaders.CheckpointWeightLoader("pi05_droid/params"),
    # num_train_steps=30_000,
    num_train_steps=5000,
    save_interval=1000,
    log_interval=50,
    fsdp_devices=4
)

Fine-tuning Qwen-2.5-VL

Fine-tuning this model requires a different version of transformers. Please enter the Qwen3-VL/qwen-vl-finetune directory and set up the environment. The tools/convert_droid_ft.py script is used to convert the dataset for fine-tuning. The following command is used for fine-tuning:

IFS=',' read -ra DEV_ARR <<<"${CUDA_VISIBLE_DEVICES}"
NPROC_PER_NODE=${#DEV_ARR[@]}
MASTER_PORT=${MASTER_PORT:-29502}
PIXELS=$((224*224))

torchrun --nproc_per_node=${NPROC_PER_NODE} --master_port=${MASTER_PORT} \
  $ROOT/qwenvl/train/train_qwen.py \
  --model_name_or_path "$MODEL_PATH" \
  --tune_mm_llm True \
  --tune_mm_vision False \
  --tune_mm_mlp True \
  --dataset_use droid_ft%100 \
  --output_dir "$OUTPUT_DIR" \
  --cache_dir "$CACHE_DIR" \
  --bf16 \
  --model_max_length 4096 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03 \
  --weight_decay 0.01 \
  --logging_steps 10 \
  --save_steps 1000 \
  --save_total_limit 3 \
  --data_flatten True \
  --data_packing False \
  --max_pixels ${PIXELS} \
  --min_pixels ${PIXELS} \
  --tune_mm_llm True \
  --tune_mm_mlp False \
  --tune_mm_vision False \
  --lora_enable False \
  --optim adamw_torch \
  --gradient_checkpointing True \
  --dataloader_num_workers 4 \
  --report_to none

About

COMS 4733 Project: A hierarchical framework for long-horizon cube manipulation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages