Benchmark#

This tutorial provides an overview of the codebase for automating and batch-testing tasks in the MineStudio benchmark. It includes the structure, purpose, and main functionalities of the framework.

MineStudio Benchmark

Quick benchmark
- Code structure
- Workflow Overview
  - Task Configuration
  - Running Tests
    - An Example: test.py
Automatic Evaluation Pipeline
- Code Structure
- Evaluating Videos with Vision-Language Models

Overview#

The MineStudio benchmark is a comprehensive framework for evaluating agent performance across a wide range of Minecraft-based tasks. It offers the following key features:

Diverse Task Support: Evaluate agents on tasks such as building, mining, crafting, collecting, and more.
Game Mode Variability: Includes both simple and hard game modes to test agents under varying levels of difficulty.
Batch Task Execution: Run multiple tasks simultaneously and record task completion videos for analysis.
VLM-Based Evaluation: Leverage Vision-Language Models to analyze and score task videos.

How to Use#

Run Batch Tests:
- Use test_pipeline.py or test.py to execute tasks.
- Ensure your environment supports GPU acceleration for optimal performance.
Analyze Results:
- Review generated videos and metrics in the eval_video folder.
- Use criteria files to score and validate task completion.

Benchmark#

Overview#

How to Use#

This Page