Course Project for COMS 4733 Computational Aspects of Robotics, Fall 2025, Columbia University.
CubeVLA is a framework for long-horizon cube manipulation and arrangement using VLM-guided planning. It proposes a hierarchical framework that utilizes Qwen-2.5-VL-3B-Instruct as a high-level planner and a fine-tuned
Data generation and inference scripts are tested on Linux (with CUDA), macOS (with Metal), and Google Colab. Fine-tuning scripts are tested on Linux (with CUDA) and Google Colab.
A custom openpi package is included as a submodule and is required for running the project on macOS and Google Colab.
When cloning this repo, make sure to update submodules:
git clone --recurse-submodules git@github.com:Soohti/CubeVLA.git
# Or if you already cloned the repo:
git submodule update --init --recursiveRun the following to set up the environment:
GIT_LFS_SKIP_SMUDGE=1 uv syncCreate a .env file in the root directory with the following content:
OPENPI_DATA_HOME=./pi05_droid
CONFIG_NAME=pi05_droid
MODEL_DOWNLOAD_URL=gs://openpi-assets/checkpoints/pi05_droid
MODEL_CHECKPOINT_PATH=./pi05_droid/openpi-assets/checkpoints/pi05_droid
DATASET_COPY_WORKERS=10
REPO_ID=soohti/droid-ft
QWEN_ID=Qwen/Qwen2.5-VL-3B-Instruct
QWEN_PATH=./qwen2.5-
download.py: Download$\pi_{0.5}$ -DROID and Qwen-2.5-VL-3B-Instruct base model checkpoints. -
run_base.py: Run inference using the$\pi_{0.5}$ -DROID base model only. Does not support hierarchical planning. -
data_generation.py: Generate training data using the simulator. -
run_predict.py: Run CubeVLA inference using the (fine-tuned) Qwen-2.5-VL and$\pi_{0.5}$ models.
The following TrainConfig is used for fine-tuning:
TrainConfig(
name="pi05_droid_ft",
model=pi0_config.Pi0Config(pi05=True, action_horizon=16, discrete_state_input=False),
data=LeRobotDROIDDataConfig(
repo_id="soohti/droid-ft",
base_config=DataConfig(prompt_from_task=True),
# extra_delta_transform=True,
assets=AssetsConfig(
# Important: reuse the original DROID norm stats during fine-tuning!
assets_dir="pi05_droid/assets",
asset_id="droid",
),
),
batch_size=64,
lr_schedule=_optimizer.CosineDecaySchedule(
# warmup_steps=10_000,
warmup_steps=100,
peak_lr=5e-5,
# decay_steps=1_000_000,
decay_steps=200,
decay_lr=5e-5,
),
optimizer=_optimizer.AdamW(clip_gradient_norm=1.0),
weight_loader=weight_loaders.CheckpointWeightLoader("pi05_droid/params"),
# num_train_steps=30_000,
num_train_steps=5000,
save_interval=1000,
log_interval=50,
fsdp_devices=4
)Fine-tuning this model requires a different version of transformers. Please enter the Qwen3-VL/qwen-vl-finetune directory and set up the environment. The tools/convert_droid_ft.py script is used to convert the dataset for fine-tuning. The following command is used for fine-tuning:
IFS=',' read -ra DEV_ARR <<<"${CUDA_VISIBLE_DEVICES}"
NPROC_PER_NODE=${#DEV_ARR[@]}
MASTER_PORT=${MASTER_PORT:-29502}
PIXELS=$((224*224))
torchrun --nproc_per_node=${NPROC_PER_NODE} --master_port=${MASTER_PORT} \
$ROOT/qwenvl/train/train_qwen.py \
--model_name_or_path "$MODEL_PATH" \
--tune_mm_llm True \
--tune_mm_vision False \
--tune_mm_mlp True \
--dataset_use droid_ft%100 \
--output_dir "$OUTPUT_DIR" \
--cache_dir "$CACHE_DIR" \
--bf16 \
--model_max_length 4096 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-5 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--warmup_ratio 0.03 \
--weight_decay 0.01 \
--logging_steps 10 \
--save_steps 1000 \
--save_total_limit 3 \
--data_flatten True \
--data_packing False \
--max_pixels ${PIXELS} \
--min_pixels ${PIXELS} \
--tune_mm_llm True \
--tune_mm_mlp False \
--tune_mm_vision False \
--lora_enable False \
--optim adamw_torch \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--report_to none