The replication code for experiments has been open sourced, and our system will be fully open source once the article is accepted.
Recent advancements in 3D reconstruction and neural rendering have enhanced the creation of high-quality digital assets, yet existing methods struggle to generalize across varying object shapes, textures, and occlusions. While Next Best View (NBV) planning and learning-based approaches offer solutions, they are often limited by predefined criteria and fail to manage occlusions effectively. We present AIR-Embodied, a novel framework that integrates embodied AI agents with large-scale pretrained multi-modal language models (MLLM) to improve active reconstruction. AIR-Embodied utilizes a three-stage process: understanding the current reconstruction state via multi-modal prompts, planning tasks with viewpoint selection and interactive actions, and employing closed-loop reasoning to ensure accurate execution. The agent dynamically refines its actions based on discrepancies between the planned and actual outcomes. Experimental evaluations across virtual and real-world environments demonstrate that AIR-Embodied significantly enhances reconstruction efficiency and quality, providing a robust solution to challenges in active 3D reconstruction.