ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai¹, Zihao Wang¹, Kewei Lian¹, Zhancun Mu¹, Xiaojian Ma², Anji Liu³, Yitao Liang^1*

¹PKU, ²BIGAI, ³UCLA

Team CraftJarvis

Paper Code YouTube arXiv Demo

MY ALT TEXT

Our ROCKET-1 can solve diverse creative tasks in Minecraft by prompting in visual-temporal context.

Playing with ROCKET-1 on Gradio.

Videos of ROCKET-1 playing Minecraft.

Abstract

Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a 76% absolute improvement in open-world interaction performance.

Method

MY ALT TEXT

Policy Architecture. ROCKET-1 processes interaction types (\(c\)), observations (\(o\)), and object segmentations (\(m\)) to predict actions (\(a\)) using a causal transformer. Observations and segmentations are concatenated and passed through a visual backbone for deep fusion. Interaction types and segmentations are randomly dropped with a set probability during training.

MY ALT TEXT

Trajectory relabeling pipeline in Minecraft and video examples. A bounding box and point selection are applied to the image center in the frame preceding the interaction event to identify the interaction object. SAM-2 is then run in reverse temporal order for a specified duration, with the interaction type remaining consistent throughout.

Integration with high-level reasoner. A GPT-4o model decomposes complex tasks into steps based on the current observation, while the Molmo model identifies interactive objects by outputting points. SAM-2 segments these objects based on the point prompts, and ROCKET-1 uses the object masks and interaction types to make decisions. GPT-4o and Molmo run at low frequencies, while SAM-2 and ROCKET-1 operate at the same frequency as the environment.

Embodied Decision-Making Pipeline Comparisons

MY ALT TEXT

Different pipelines in solving embodied decision-making tasks. (a) End-to-end pipeline modeling token sequences of language, observations, and actions. (b) Language prompting: VLMs decompose instructions for language-conditioned policy execution. (c) Latent prompting: maps discrete behavior tokens to low-level actions. (d) Future-image prompting: fine-tunes VLMs and diffusion models for image-conditioned control. (e) Visual-temporal prompting: VLMs generate segmentations and interaction cues to guide ROCKET-1.

Minecraft Interaction Benchmark

The Minecraft Interaction Benchmark contains six interaction types in Minecraft, totaling 12 tasks. Unlike previous benchmarks, these tasks emphasize interacting with objects at specific spatial locations. For example, in "hunt the sheep in the right fence," the task fails if the agent kills the sheep on the left side. Some tasks, such as "place the oak door on the diamond block," never appear in the training set. It is also designed to evaluate zero-shot generalization capabilities.

MY ALT TEXT

Experiment Results

Results on the Minecraft Interaction benchmark. Each task is tested 32 times, and the average success rate (in \(\%\)) is reported as the final result. "Human" indicates instructions provided by a human.

MY ALT TEXT

Hunt the sheep in the right fence.

Hunt the cow while do not touch the sheep.

Mine the emerald.

Mine the coal.

Interact with the left chest.

Open the door then open the chest in the house.

Approach the nearest village.

Approach the ocean.

Set fire on a tree.

Use bucket to get lava.

Place minecart on the rail.

Place the oak door on the diamond block.

BibTeX


        @article{cai2024rocket1,
          title   = {ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting},
          author  = {Shaofei Cai and Zihao Wang and Kewei Lian and Zhancun Mu and Xiaojian Ma and Anji Liu and Yitao Liang},
          year    = {2024},
          journal = {arXiv preprint arXiv: 2410.17856}
        }