MCU: An Evaluation Framework for Open-Ended Game Agents

Introduction

MCU (Minecraft Universe) is a comprehensive evaluation framework designed to benchmark AI agents in open-ended environments. Built on Minecraft, MCU enables systematic and scalable assessments through:


MCU addresses both inter-task and intra-task diversity, reflecting real-world variations in environment conditions. It highlights critical challenges in control precision, knowledge usage, and adaptive behavior for modern foundation agents.

Example tasks

Build Nether Portal
Build Snow Golem
Build Waterfall
Explore World
Mine Diamond
Find Village
Hunt Sheep
Dig 3 blocks and Fill 1 block
Craft Clock
Make Fire
Make Obsidian
Combat Enderman
Carve Pumpkin
Drink Harming Potion
Prepare Birthday Gift
Dye & Shear Sheep

Benchmarking Process

MCU framework consists of task generation and agent evaluation, ensuring scalable and high-quality benchmarking. For task generation, atomic tasks are sourced from the Minecraft Wiki, in-game data, existing benchmarks (MineDojo, SkillForge), and LLM-augmented brainstorming. An LLM-based config generator dynamically creates executable task setups, which are then verified through self-reflection. For agent evaluation, trajectories are recorded as videos and analyzed by a VLM (GPT-4o), which assesses performance across six key dimensions (task progress, action control, material usage, efficiency, error recognition, and creativity).

Source

Experimental Results

In a large-scale evaluation of 90 atomic and 60 composite tasks, we observed that while foundation agents like GROOT, STEVE-I, and VPT-RL perform adequately on simple tasks, their success rates decrease drastically on more complex tasks or when faced with slight environment changes. For instance, in the “sleep in bed” task, moving the bed from outdoors to a furnished room caused GROOT to misidentify the bed or leave the area—indicating a lack of spatial understanding and generalization.

Simple: Outdoor Bed
Hard: Indoor Bed

Moreover, in tasks requiring creativity and error recognition (e.g., construction or combat), agents lag significantly behind humans. They fail to adapt or revise strategies based on feedback—highlighting a key gap between current AI systems and human-level reasoning.

Agent Performance Results
Left: Multi-dimensional capabilities; Middle: Performance degradation with increasing difficulty; Right: Human vs. AutoEval
Human-AI Score Agreement
Large-scale evaluation on 90 atomic tasks with AutoEval

BibTeX


                @inproceedings{zheng2025mcu,
                    title     = {MCU: An Evaluation Framework for Open-Ended Game Agents},
                    author    = {Zheng, Xinyue and Lin, Haowei and He, Kaichen and Wang, Zihao and Zheng, Zilong and Liang, Yitao},
                    booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
                    year      = {2025},
                    url       = {https://arxiv.org/abs/2310.08367}
                }