Built-in Models: STEVE-1#

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

Quick Facts

STEVE-1 [1] finetunes VPT to follow short-horizon open-ended text and visual instructions without the need of costly human annotations.

Insights#

Pre-trained foundation models demonstrates suprising ability to be efficiently fine-tuned for becoming instruction-following. In sequential decision-making domains, two foundation models in Minecraft are released: VPT [2] and MineCLIP [3], opening intriguing possibilities for exploring the finetuning of instruction-awared decision-making agents.

The authors draw insights from unCLIP [4] to propose a two-stage learning framework for training STEVE-1, eliminating laborious human annotations.

Method#

../_images/steve.png

STEVE-1 architecture. Image credit: [1]#

To create a policy in Minecraft conditioned text instructions \(y\), they ultilize a dataset of (partially) annotated trajectories \([(\tau_1, y_1), (\tau_2, y_2), \dots, (\tau_n, \emptyset)]\) They employ MineCLIP which is capable of generating aligned latents \(z_{\tau_{t:t+16}}\) and \(z_y\), where \(z_{\tau_{\text{goal}}} = z_{\tau_{t:t+16}}\) is an embedding of 16 consecutive frames.

The instruction-following model is composed of a policy and a prior:

\[p(\tau | y) = p(\tau, z_{\tau_{\text{goal}}} | y) = p(\tau | z_{\tau_{\text{goal}}} ) p(z_{\tau_{\text{goal}}} | y),\]

where the policy generates a trajectory \(\tau\) conditioned on the aligned latents \(z_{\tau_{\text{goal}}}\) and the prior generates \(z_{\tau_{\text{goal}}}\) conditioned on the instruction \(y\).

To train the policy, they use a modification of hindsight relabeling to generate goals for each trajectory:

../_images/steve_hindsight.png

They randomly select timesteps from episodes and use hindsight relabeling to set the intermediate goals for the trajectory segments to those visual MineCLIP embeddings. Image credit: [1]#

By finetuning VPT on this dataset, the policy learns to reach given goal states (visual goals).

To also learn to follow text instructions, they train a conditioned variational autoencoder (CVAE) with Gaussian prior and posterior to translate from a text embedding \(z_y\) to a visual embedding \(z_{\tau_{\text{goal}}}\). The training objective is a standard ELBO loss:

\[\mathcal{L}_{\text{prior}}(\phi) = \mathbb{E}_{(z_{\tau_{\text{goal}}}, z_y) \sim \mathcal{D}_{\text{labels}}} \left[ \text{KL}(q_{\phi}(z_{\tau_{\text{goal}}}|z_y)||p(z_{\tau_{\text{goal}}})) - \mathbb{E}_{c \sim q_\phi(z_{\tau_{\text{goal}}}|z_y)}[\log p_\phi(z_{\tau_{\text{goal}}}|c,z_y)] \right].\]

They ultilize classifier-free guidance to train the policy, where the goal embedding is occasionally droped out during training. During inference, they compute a combination of logits with and without combination to generate the final trajectory.

Citations#