ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

1PKU, 2BIGAI, 3UCLA
Team CraftJarvis
MY ALT TEXT

Our ROCKET-1 can solve diverse creative tasks in Minecraft by prompting in visual-temporal context.

Videos of ROCKET-1 playing Minecraft.

Playing with ROCKET-1 on Gradio.

Abstract

Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. A key issue is the difficulty in smoothly connecting individual entities in low-level observations with abstract concepts required for planning. A common approach to address this problem is through the use of hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language and imagined observations. However, language often fails to effectively convey spatial information, while generating future images with sufficient accuracy remains challenging. To address these limitations, we propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from both past and present observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, with real-time object tracking provided by SAM-2. Our method unlocks the full potential of VLMs’ visual-language reasoning abilities, enabling them to solve complex creative tasks, especially those heavily reliant on spatial understanding. Experiments in Minecraft demonstrate that our approach allows agents to accomplish previously unattainable tasks, highlighting the effectiveness of visual-temporal context prompting in embodied decision-making.

Method

MY ALT TEXT

Policy Architecture. ROCKET-1 processes interaction types (\(c\)), observations (\(o\)), and object segmentations (\(m\)) to predict actions (\(a\)) using a causal transformer. Observations and segmentations are concatenated and passed through a visual backbone for deep fusion. Interaction types and segmentations are randomly dropped with a set probability during training.

MY ALT TEXT

Trajectory relabeling pipeline in Minecraft and video examples. A bounding box and point selection are applied to the image center in the frame preceding the interaction event to identify the interaction object. SAM-2 is then run in reverse temporal order for a specified duration, with the interaction type remaining consistent throughout.
MY ALT TEXT

Integration with high-level reasoner. A GPT-4o model decomposes complex tasks into steps based on the current observation, while the Molmo model identifies interactive objects by outputting points. SAM-2 segments these objects based on the point prompts, and ROCKET-1 uses the object masks and interaction types to make decisions. GPT-4o and Molmo run at low frequencies, while SAM-2 and ROCKET-1 operate at the same frequency as the environment.

Embodied Decision-Making Pipeline Comparisons

MY ALT TEXT

Different pipelines in solving embodied decision-making tasks. (a) End-to-end pipeline modeling token sequences of language, observations, and actions. (b) Language prompting: VLMs decompose instructions for language-conditioned policy execution. (c) Latent prompting: maps discrete behavior tokens to low-level actions. (d) Future-image prompting: fine-tunes VLMs and diffusion models for image-conditioned control. (e) Visual-temporal prompting: VLMs generate segmentations and interaction cues to guide ROCKET-1.

Minecraft Interaction Benchmark

The Minecraft Interaction Benchmark contains six interaction types in Minecraft, totaling 12 tasks. Unlike previous benchmarks, these tasks emphasize interacting with objects at specific spatial locations. For example, in "hunt the sheep in the right fence," the task fails if the agent kills the sheep on the left side. Some tasks, such as "place the oak door on the diamond block," never appear in the training set. It is also designed to evaluate zero-shot generalization capabilities.

MY ALT TEXT

Experiment Results

Results on the Minecraft Interaction benchmark. Each task is tested 32 times, and the average success rate (in \(\%\)) is reported as the final result. "Human" indicates instructions provided by a human.

MY ALT TEXT

BibTeX


        @article{cai2024rocket1,
          title   = {ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting},
          author  = {Shaofei Cai and Zihao Wang and Kewei Lian and Zhancun Mu and Xiaojian Ma and Anji Liu and Yitao Liang},
          year    = {2024},
          journal = {arXiv preprint arXiv: 2410.17856}
        }