Abstract
Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. When integrated into an embodied agent, existing embodied VLM works either output detailed action sequences at the manipulation level or only provide plans at an abstract level, leaving a gap between high-level planning and real-world manipulation. To bridge this gap, we introduce Octopus, an embodied vision-language programmer that uses executable code generation as a medium to connect planning and manipulation. Octopus is designed to 1) proficiently comprehend an agent’s visual and textual task objectives, 2) formulate intricate action sequences, and 3) generate executable code. To facilitate Octopus model development, we introduce OctoVerse: a suite of environments tailored for benchmarking vision-based code generators on a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games such as Grand Theft Auto (GTA) and Minecraft. To train Octopus, we leverage GPT-4 to control an explorative agent that generates training data, i.e., action blueprints and corresponding executable code. We also collect feedback that enables an enhanced training scheme called Reinforcement Learning with Environmental Feedback (RLEF). Through a series of experiments, we demonstrate Octopus’s functionality and present compelling results, showing that the proposed RLEF refines the agent’s decision-making. By open-sourcing our simulation environments, dataset, and model architecture, we aspire to ignite further innovation and foster collaborative applications within the broader embodied AI community. The project page is available at https://choiszt.github.io/Octopus/.
Original language | English |
---|---|
Title of host publication | Computer Vision – ECCV 2024 |
Subtitle of host publication | 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part I |
Editors | Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol |
Publisher | Springer Cham |
Pages | 20-38 |
Number of pages | 19 |
Edition | 1st |
ISBN (Electronic) | 9783031732324 |
ISBN (Print) | 9783031732317 |
DOIs | |
Publication status | Published - 29 Sept 2024 |
Event | 18th European Conference on Computer Vision, ECCV 2024 - Milan, Italy Duration: 29 Sept 2024 → 4 Oct 2024 https://eccv.ecva.net/Conferences/2024 (Conference Website) https://link.springer.com/book/10.1007/978-3-031-73232-4 (Conference Proceedings) |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Volume | 15059 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 18th European Conference on Computer Vision, ECCV 2024 |
---|---|
Country/Territory | Italy |
City | Milan |
Period | 29/09/24 → 4/10/24 |
Internet address |
|
Scopus Subject Areas
- Theoretical Computer Science
- General Computer Science