Songze Li, Zun Wang, Gengze Zhou, Jialu Li, Xiangyu Zeng, Limin Wang, Yu Qiao, Qi Wu, Mohit Bansal, Yi Wang
Goal-oriented language-guided navigation requires robust exploration capabilities for agents to navigate to specified goals in unknown environments without step-by-step instructions. Existing methods tend to exclusively utilize shortest-path trajectories, lacking effective exploration priors for prioritizing the success rate. To address the above challenges, we present SID, a goal-oriented language-guided navigation learning approach with Self-Improving Demonstrations. Specifically, SID learns an initial agent on the shortest-path data sampled from environments and then leverages this agent to generate novel exploration trajectories. The novel rollouts provide demonstrations with stronger exploration signals to train a better agent, which in turn produces higher-quality agent demonstrations for the next round of training. We show that this iterative self-improving pipeline readily scales to new environments, and the resulting demonstrations can be transferred across a variety of language-guided navigation tasks, elevating the performance ceiling in diverse goal-oriented navigation. Extensive experiments demonstrate that SID significantly boosts the exploration capabilities and generalization of navigation agents. The resulting agent achieves new state-of-the-art performance on goal-oriented language-guided navigation tasks, including REVERIE, SOON, notably achieving a 50.9% success rate on the unseen validation splits of SOON, surpassing the prior leading approaches by a margin of 13.9%.
[2025-09-30] We realease the paper for SID-VLN.
[2025-09-22] We realease the code and data for SID-VLN.
We test under the following environment:
- Python 3.8.10
- Pytorch 2.0.0
- CUDA Version 11.7
-
Install Matterport3D simulators: follow detailed instructions here. We use the latest version instead of v0.1. Here is simplified instructions:
git clone git@github.com:peteanderson80/Matterport3DSimulator.git git submodule update --init --recursive sudo apt-get install libjsoncpp-dev libepoxy-dev libglm-dev libosmesa6 libosmesa6-dev libglew-dev libopencv-dev mkdir build && cd build cmake -DEGL_RENDERING=ON .. make -j8
After successful installation, run:
cp your_path/Matterport3DSimulator/build/MatterSim.cpython-38-x86_64-linux-gnu.so your_conda_path/envs/sidvln/lib/python3.8/MatterSim.cpython-38-x86_64-linux-gnu.so export PYTHONPATH=your_path/SIDVLN/mapnav:$PYTHONPATH export PYTHONPATH=your_path/Matterport3DSimulator/build:$PYTHONPATH
-
Install requirements:
conda create --name sidvln python=3.8.10 conda activate sidvln cd SID-VLN pip install -r requirements.txt
We release our final pretrained model and available data here. Details:
Connectivity:
- Connectivity of the navigation graphs.
Data:
scan_round0_860scan.jsonl
– Image goal navigatoin trajectories in 800 HM3D environements.sid_lang_goal.jsonl
– Final detailed caption goal navigatoin trajectories for pretraining and REVERIE augmentation.img_goal_val*.json
– Image goal navigation validation seen and unseen splits.cap_goal_val*.json
– Caption goal navigation validation seen and unseen splits.scanvp_candview_relangles_with_hm3d_gibson.json
– Candidates related to scan and vp in HM3D environments.
Features:
siglip_base.hdf5
– SigLIP features on MP3D and HM3D environments.dinov2_base.hdf5
– DINOv2 features on MP3D and HM3D environments.obj.avg.top3.min80_vit_base_patch16_224_imagenet.hdf5
– Object features for REVERIE.
HM3D_cap:
- Generated detailed style captions for target images in HM3D and MP3D environments.
Model:
model_step_124000.pt
– The final pretrained model for downstream VLN finetuning.img_goal_best_val_unseen
– The image goal navigation agent which can be utilized to generate trajectories with high quality demonstrations on exploration strategies.model_LXRT.pth
– The pretrained LXMERT model for initialization DUET.
The data folder should follow this structure:
```shell
datasets/
├── ckpts/
├── model_LXRT.pth
├── img_goal_best_val_unseen
├── model_step_124000.pt
|── REVERIE
│ ├── annotations/
│ ├── scan_round0_860scan.jsonl
│ ├── sid_lang_goal.jsonl
│ ├── img_goal_val*.json
│ ├── cap_goal_val*.json
│ └── scanvp_candview_relangles_with_hm3d_gibson.json
│ ├── connectivity/
├── scanname_connectivity.json
└── scans.txt
│ ├── features/
│ ├── siglip_base.hdf5
│ ├── dinov2_base.hdf5
│ └── obj.avg.top3.min80_vit_base_patch16_224_imagenet.hdf5
├── SOON/
-
Multi-Round SID Pre-training
We use 8 NVIDIA A800 GPUs for pre-training agents on image goal navigation.
cd pretrain bash run_img_goal.sh
-
SID Fine-tunning & Trajectories Generating
We use 8 NVIDIA A800 GPUs for fine-tuning agents and generating trajectories for next-round training.
cd mapnav bash scripts/run_img_goal.sh
-
Langugae Goal Pre-training
We use 8 NVIDIA A800 GPUs for pre-training language goal navigation agents.
bash run_lang_goal.sh
-
Downstream VLN tasks Fine-tuning
We use one NVIDIA A800 GPU for finetuning our agent on downstream VLN tasks. Concrete config is presented in the scripts.
bash run_lang_goal.sh
Please feel free to open an issue if you encounter any problems or have questions about SID-VLN.
If you find our work useful in your research, please consider starring 🌟 this repo and cite the following paper:
@article{li2025learning,
title={Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale},
author={Li, Songze and Wang, Zun and Zhou, Gengze and Li, Jialu and Zeng, Xiangyu and Wang, Limin and Qiao, Yu and Wu, Qi and Bansal, Mohit and Wang, Yi},
journal={arXiv preprint arXiv:2509.24910},
year={2025}
}
We thank the developers of DUET, SRDF, InternVL for their public code release.