We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis.
@misc{cai2023groot,
title={GROOT: Learning to Follow Instructions by Watching Gameplay Videos},
author={Shaofei Cai and Bowei Zhang and Zihao Wang and Xiaojian Ma and Anji Liu and Yitao Liang},
year={2023},
eprint={2310.08235},
archivePrefix={arXiv},
primaryClass={cs.AI}
}