MCU (Minecraft Universe) is a comprehensive evaluation framework designed to benchmark AI agents in open-ended environments. Built on Minecraft, MCU enables systematic and scalable assessments through:
MCU framework consists of task generation and agent evaluation, ensuring scalable and high-quality benchmarking. For task generation, atomic tasks are sourced from the Minecraft Wiki, in-game data, existing benchmarks (MineDojo, SkillForge), and LLM-augmented brainstorming. An LLM-based config generator dynamically creates executable task setups, which are then verified through self-reflection. For agent evaluation, trajectories are recorded as videos and analyzed by a VLM (GPT-4o), which assesses performance across six key dimensions (task progress, action control, material usage, efficiency, error recognition, and creativity).
In a large-scale evaluation of 90 atomic and 60 composite tasks, we observed that while foundation agents like GROOT, STEVE-I, and VPT-RL perform adequately on simple tasks, their success rates decrease drastically on more complex tasks or when faced with slight environment changes. For instance, in the “sleep in bed” task, moving the bed from outdoors to a furnished room caused GROOT to misidentify the bed or leave the area—indicating a lack of spatial understanding and generalization.
Moreover, in tasks requiring creativity and error recognition (e.g., construction or combat), agents lag significantly behind humans. They fail to adapt or revise strategies based on feedback—highlighting a key gap between current AI systems and human-level reasoning.
@inproceedings{zheng2025mcu,
title = {MCU: An Evaluation Framework for Open-Ended Game Agents},
author = {Zheng, Xinyue and Lin, Haowei and He, Kaichen and Wang, Zihao and Zheng, Zilong and Liang, Yitao},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
year = {2025},
url = {https://arxiv.org/abs/2310.08367}
}