Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

1State Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI), 2Department of Automation, Tsinghua University, 3University of California, Los Angeles
*Indicates Equal Contribution

ICRA 2025

Abstract

Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.

Approach


A brief overview of COME-robot's workfow. Given a task instruction, COME-robot employs GPT-4V for reasoning and generates a code-based plan. Through feedback obtained from the robot's execution and interaction with the environment, it iteratively updates the subsequent plan or recovers from failures, ultimately accomplishing the given task.



MY ALT TEXT


The unique properties of COME-robot: Active Perception, Situated Commonsense Reasoning, and Recover from Failure. Actions to be executed as reasoned by GPT-4V are highlighted in blue, identifed failures are highlighted in red, and outcomes verifed after observation or recovery are highlighted in green.



Results

mobile manipulation



tabletop manipulation

Cases of recover from failures

System Prompts

MY ALT TEXT

BibTeX

@misc{zhi2024closedloop,
        title={Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V}, 
        author={Peiyuan Zhi and Zhiyuan Zhang and Muzhi Han and Zeyu Zhang and Zhitian Li and Ziyuan Jiao and Baoxiong Jia and Siyuan Huang},
        year={2024},
        eprint={2404.10220},
        archivePrefix={arXiv},
        primaryClass={cs.RO}
  }