Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

1State Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI), 2Department of Automation, Tsinghua University, 3University of California, Los Angeles
*Indicates Equal Contribution

Abstract

Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. We present COME-robot, the first closed-loop framework utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. We meticulously construct a library of action primitives for robot exploration, navigation, and manipulation, serving as callable execution modules for GPT-4V in task planning. On top of these modules, GPT-4V serves as the brain that can accomplish multimodal reasoning, generate action policy with code, verify the task progress, and provide feedback for replanning. Such design enables COME-robot to (i) actively perceive the environments, (ii) perform situated reasoning, and (iii) recover from failures. Through comprehensive experiments involving 8 challenging real-world tabletop and manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~25%) compared to state-of-the-art baseline methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.

Approach


A brief overview of COME-robot's workfow. Given a task instruction, COME-robot employs GPT-4V for reasoning and generates a code-based plan. Through feedback obtained from the robot's execution and interaction with the environment, it iteratively updates the subsequent plan or recovers from failures, ultimately accomplishing the given task.



MY ALT TEXT


The unique properties of COME-robot: Active Perception, Situated Commonsense Reasoning, and Recover from Failure. Actions to be executed as reasoned by GPT-4V are highlighted in blue, identifed failures are highlighted in red, and outcomes verifed after observation or recovery are highlighted in green.



Results

mobile manipulation



tabletop manipulation

Cases of recover from failures

BibTeX

@misc{zhi2024closedloop,
        title={Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V}, 
        author={Peiyuan Zhi and Zhiyuan Zhang and Muzhi Han and Zeyu Zhang and Zhitian Li and Ziyuan Jiao and Baoxiong Jia and Siyuan Huang},
        year={2024},
        eprint={2404.10220},
        archivePrefix={arXiv},
        primaryClass={cs.RO}
  }