Introducing ROCKET-2, a state-of-the-art agent trained in Minecraft. Our contributions are threefold:
We aim to develop a goal specification method that is semantically clear, spatially sensitive, and intuitive
for human users to guide agent interactions in embodied environments.
To this end,we develop ROCKET-2, a state-of-the-art agent trained in Minecraft, achieving an improvement
in the efficiency of inference 3x to 6x. We show ROCKET-2 can directly interpret goals from human
camera views for the first time, paving the way for better human-agent interaction.
Our contributions are threefold:
Click to jump to each section.
Our goal is to learn a
goal-conditioned visuomotor policy, which allows humans to specify goal objects for interaction using semantic
segmentation across camera views. Formally, we aim to learn a policy
\( \pi_{cross}(a_t \mid o_{1:t}, \{o_g, m_g\}, c_g) \), where \( a_t \) represents the
action at time \( t \), \( c_g \) denotes the type of interaction.
To train such visuomotor policy, we assume access to a dataset
\(D_{cross} = \{c^n, (o_t^n, a_t^n, o_g^n, m_g^n)_{t=1}^{L(n)}\}_{n=1}^{N}\) consisting of \(N\) successful demonstration episodes,
where \(L(n)\) is the length of episode \(n\).
Within each episode, if \(m_t\) is non-empty, all \((o_t , m_t)\) pairs indicate the
same object. Consequently, we can arbitrarily pick one observation frame as the goal view condition for the entire trajectory.
ROCKET-2 Architecture. ROCKET-2 consists of three parts: (1) a non-causal transformer for spatial fusion, which establishes the relationship between the agent’s and human’s camera views; (2) a causal transformer for temporal fusion, ensuring consistency for goal tracking; (3) a decoder module, made of a feedforward neural network (FFN), which predicts goal-related visuals cues and actions.
Cross-View Dataset Generation.
We employ
the backward trajectory relabeling technique proposed in
Cross-View Consistency Loss.
We observe that relying solely on behavior cloning loss
Target Visibility Loss. Due to the partial observability in 3D environments, it is common for target objects in interaction trajectories to disappear from the field of view and reappear later. We propose training the model to predict whether the target object is currently visible.
Automated evaluation of ROCKETs relies on Molmo and SAM to generate a segmentation mask for the target object in the given views. ROCKET-1 (R1) requires object masks for all agent observations, whereas ROCKET-2 (R2) only needs one or a few object masks. While increasing interaction frequency with Molmo improves ROCKET-1's performance, it suffers from high inference time.
Performance-Efficiency Comparison on the Minecraft Interaction Benchmark.
We demonstrate that ROCKET-2 significantly improves inference speed while maintaining high
interaction success. Following
Case Study of Human-Agent Interaction. We present two case studies illustrating ROCKET-2 interprets human intent under the cross-view goal specification interface.
Visualization Analysis of Cross-View Alignment. Prominent non-goal objects, referred to as “landmarks”, play a crucial role in assisting humans or agents in localizing goal objects within a scene. We prepare a current view observation and a third view with goal segmentation and inspect the softmax-normalized attention map of the first self-attention layer in the spatial transformer. This map is overlaid on the third view (goal view) to reflect its responsiveness to patch \(i\) in the current view. Our findings reveal that ROCKET-2 effectively matches cross-view consistency even under significant geometric deformations and distance variations.
Cross-Episode Generalization. We observe that ROCKET-2 exhibits crossepisode generalization capabilities. As shown in Figure 6, he selected goal views come from different episodes, each generated with a unique world seed.
To improve human-agent interaction in embodied worlds, we propose a cross-view goal specification approach. Since behavior cloning alone fails to align the agent with human views, we introduce cross-view consistency and target visibility losses to enhance alignment. ROCKET2 achieves state-of-the-art performance on the Minecraft Interaction Benchmark with a 3x to 6x efficiency boost. Visualizations and case studies validate our method.
@misc{cai2025rocket2,
title={ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment},
author={Shaofei Cai and Zhancun Mu and Anji Liu and Yitao Liang},
year={2025},
eprint={2503.02505},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.02505},
}