CubeVLA

Course Project for COMS 4733 Computational Aspects of Robotics, Fall 2025, Columbia University.

Overview

CubeVLA is a framework for long-horizon cube manipulation and arrangement using VLM-guided planning. It proposes a hierarchical framework that utilizes Qwen-2.5-VL-3B-Instruct as a high-level planner and a fine-tuned $π_{0.5}$ model as the low-level actor. The system is validated on the ManipulationNet benchmark within the Genesis simulator.

Installation

Data generation and inference scripts are tested on Linux (with CUDA), macOS (with Metal), and Google Colab. Fine-tuning scripts are tested on Linux (with CUDA) and Google Colab.

A custom openpi package is included as a submodule and is required for running the project on macOS and Google Colab.

When cloning this repo, make sure to update submodules:

git clone --recurse-submodules git@github.com:Soohti/CubeVLA.git

# Or if you already cloned the repo:
git submodule update --init --recursive

Run the following to set up the environment:

GIT_LFS_SKIP_SMUDGE=1 uv sync

Environment Variables

Create a .env file in the root directory with the following content:

OPENPI_DATA_HOME=./pi05_droid
CONFIG_NAME=pi05_droid
MODEL_DOWNLOAD_URL=gs://openpi-assets/checkpoints/pi05_droid
MODEL_CHECKPOINT_PATH=./pi05_droid/openpi-assets/checkpoints/pi05_droid

DATASET_COPY_WORKERS=10
REPO_ID=soohti/droid-ft

QWEN_ID=Qwen/Qwen2.5-VL-3B-Instruct
QWEN_PATH=./qwen2.5

Scripts

download.py: Download $\pi_{0.5}$-DROID and Qwen-2.5-VL-3B-Instruct base model checkpoints.
run_base.py: Run inference using the $\pi_{0.5}$-DROID base model only. Does not support hierarchical planning.
data_generation.py: Generate training data using the simulator.
run_predict.py: Run CubeVLA inference using the (fine-tuned) Qwen-2.5-VL and $\pi_{0.5}$ models.

Fine-tuning $\pi_{0.5}$-DROID

The following TrainConfig is used for fine-tuning:

TrainConfig(
    name="pi05_droid_ft",
    model=pi0_config.Pi0Config(pi05=True, action_horizon=16, discrete_state_input=False),
    data=LeRobotDROIDDataConfig(
        repo_id="soohti/droid-ft",
        base_config=DataConfig(prompt_from_task=True),
        # extra_delta_transform=True,
        assets=AssetsConfig(
            # Important: reuse the original DROID norm stats during fine-tuning!
            assets_dir="pi05_droid/assets",
            asset_id="droid",
        ),
    ),
    batch_size=64,
    lr_schedule=_optimizer.CosineDecaySchedule(
        # warmup_steps=10_000,
        warmup_steps=100,
        peak_lr=5e-5,
        # decay_steps=1_000_000,
        decay_steps=200,
        decay_lr=5e-5,
    ),
    optimizer=_optimizer.AdamW(clip_gradient_norm=1.0),
    weight_loader=weight_loaders.CheckpointWeightLoader("pi05_droid/params"),
    # num_train_steps=30_000,
    num_train_steps=5000,
    save_interval=1000,
    log_interval=50,
    fsdp_devices=4
)

Fine-tuning Qwen-2.5-VL

Fine-tuning this model requires a different version of transformers. Please enter the Qwen3-VL/qwen-vl-finetune directory and set up the environment. The tools/convert_droid_ft.py script is used to convert the dataset for fine-tuning. The following command is used for fine-tuning:

IFS=',' read -ra DEV_ARR <<<"${CUDA_VISIBLE_DEVICES}"
NPROC_PER_NODE=${#DEV_ARR[@]}
MASTER_PORT=${MASTER_PORT:-29502}
PIXELS=$((224*224))

torchrun --nproc_per_node=${NPROC_PER_NODE} --master_port=${MASTER_PORT} \
  $ROOT/qwenvl/train/train_qwen.py \
  --model_name_or_path "$MODEL_PATH" \
  --tune_mm_llm True \
  --tune_mm_vision False \
  --tune_mm_mlp True \
  --dataset_use droid_ft%100 \
  --output_dir "$OUTPUT_DIR" \
  --cache_dir "$CACHE_DIR" \
  --bf16 \
  --model_max_length 4096 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03 \
  --weight_decay 0.01 \
  --logging_steps 10 \
  --save_steps 1000 \
  --save_total_limit 3 \
  --data_flatten True \
  --data_packing False \
  --max_pixels ${PIXELS} \
  --min_pixels ${PIXELS} \
  --tune_mm_llm True \
  --tune_mm_mlp False \
  --tune_mm_vision False \
  --lora_enable False \
  --optim adamw_torch \
  --gradient_checkpointing True \
  --dataloader_num_workers 4 \
  --report_to none

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
Qwen3-VL @ 33e7fa2		Qwen3-VL @ 33e7fa2
openpi @ bac68f4		openpi @ bac68f4
pi05_droid		pi05_droid
video		video
.env		.env
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
data_generation.py		data_generation.py
dataset.py		dataset.py
download.py		download.py
environment.py		environment.py
pyproject.toml		pyproject.toml
robot.py		robot.py
run_base.py		run_base.py
run_predict.py		run_predict.py
tasks.py		tasks.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CubeVLA

Overview

Installation

Environment Variables

Scripts

Fine-tuning $\pi_{0.5}$-DROID

Fine-tuning Qwen-2.5-VL

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Soohti/CubeVLA

Folders and files

Latest commit

History

Repository files navigation

CubeVLA

Overview

Installation

Environment Variables

Scripts

Fine-tuning $\pi_{0.5}$-DROID

Fine-tuning Qwen-2.5-VL

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages