COME robot

Abstract

Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.

A brief overview of COME-robot's workfow. Given a task instruction, COME-robot employs GPT-4V for reasoning and generates a code-based plan. Through feedback obtained from the robot's execution and interaction with the environment, it iteratively updates the subsequent plan or recovers from failures, ultimately accomplishing the given task.

COME-robot's planner has two key designs: Open-Vocabulary Perception and Reasoning and Closed Loop Feedback and restoration. The former helps the robot ground open-ended instructions in real environment, and the latter guarantees task's completion. Actions to be executed as reasoned by GPT-4V are highlighted in blue, identified failures are highlighted in red, and analysis after observation or verification are highlighted in green.

Results

legged manipulation

mobile manipulation

tabletop manipulation

BibTeX

@misc{zhi2025closedloopopenvocabularymobilemanipulation, title={Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V}, author={Peiyuan Zhi and Zhiyuan Zhang and Yu Zhao and Muzhi Han and Zeyu Zhang and Zhitian Li and Ziyuan Jiao and Baoxiong Jia and Siyuan Huang}, year={2025}, eprint={2404.10220}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2404.10220}, }

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

Abstract

Approach

Results

legged manipulation

mobile manipulation

tabletop manipulation

Cases of recover from failures

System Prompts

BibTeX