MCU: An Evaluation Framework for Open-Ended Game Agents

Introduction

MCU (Minecraft Universe) is a comprehensive evaluation framework designed to benchmark AI agents in open-ended environments. Built on Minecraft, MCU enables systematic and scalable assessments through:

🧩 3,452+ atomic tasks across 11 major categories and 41 subcategories
🌍 Infinite compositional task generation via LLM-based dynamic configuration
🤖 AutoEval: multi-dimensional evaluation with 91.5% human agreement
🚀 Generalization-oriented challenges including spatial reasoning, control, creativity

MCU addresses both inter-task and intra-task diversity, reflecting real-world variations in environment conditions. It highlights critical challenges in control precision, knowledge usage, and adaptive behavior for modern foundation agents.

Example tasks

Build Nether Portal

Build Snow Golem

Build Waterfall

Explore World

Mine Diamond

Find Village

Hunt Sheep

Dig 3 blocks and Fill 1 block

Craft Clock

Make Fire

Make Obsidian

Combat Enderman

Carve Pumpkin

Drink Harming Potion

Prepare Birthday Gift

Dye & Shear Sheep

Benchmarking Process

MCU framework consists of task generation and agent evaluation, ensuring scalable and high-quality benchmarking. For task generation, atomic tasks are sourced from the Minecraft Wiki, in-game data, existing benchmarks (MineDojo, SkillForge), and LLM-augmented brainstorming. An LLM-based config generator dynamically creates executable task setups, which are then verified through self-reflection. For agent evaluation, trajectories are recorded as videos and analyzed by a VLM (GPT-4o), which assesses performance across six key dimensions (task progress, action control, material usage, efficiency, error recognition, and creativity).

Experimental Results

In a large-scale evaluation of 90 atomic and 60 composite tasks, we observed that while foundation agents like GROOT, STEVE-I, and VPT-RL perform adequately on simple tasks, their success rates decrease drastically on more complex tasks or when faced with slight environment changes. For instance, in the “sleep in bed” task, moving the bed from outdoors to a furnished room caused GROOT to misidentify the bed or leave the area—indicating a lack of spatial understanding and generalization.

Simple: Outdoor Bed

Hard: Indoor Bed

Moreover, in tasks requiring creativity and error recognition (e.g., construction or combat), agents lag significantly behind humans. They fail to adapt or revise strategies based on feedback—highlighting a key gap between current AI systems and human-level reasoning.

Left: Multi-dimensional capabilities; Middle: Performance degradation with increasing difficulty; Right: Human vs. AutoEval

Large-scale evaluation on 90 atomic tasks with AutoEval

BibTeX


                @inproceedings{zheng2025mcu,
                    title     = {MCU: An Evaluation Framework for Open-Ended Game Agents},
                    author    = {Zheng, Xinyue and Lin, Haowei and He, Kaichen and Wang, Zihao and Zheng, Zilong and Liang, Yitao},
                    booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
                    year      = {2025},
                    url       = {https://arxiv.org/abs/2310.08367}
                }