Quick benchmark#

This document outlines the structure and workflow of the benchmark module, a comprehensive framework designed for evaluating agent performance in Minecraft environments.

Code structure#

Below is the structure of the benchmark module, which organizes task definitions and testing scripts for evaluation:

benchmark/
    ├── task_configs/ 
    │   ├── simple/ 
    │   │   └── Task definitions for simple tasks.
    │   ├── hard/
    │       └── Task definitions for complex tasks.
    ├── test_pipeline.py
    │   └── Example script for parallelized and batched task execution.
    ├── test.py
    │   └── Example script for running batch tests.
    ├── utility/
    │   └── Functionality for input reading and callback features.

Workflow Overview#

Task Configuration#

Tasks are defined in YAML files located in the task_configs/ directory. under the appropriate difficulty subdirectory (e.g., simple/ or hard/). Example YAML:

custom_init_commands: 
- /give @s minecraft:water_bucket 3
- /give @s minecraft:stone 64
- /give @s minecraft:dirt 64
- /give @s minecraft:shovel{Enchantments:[{id:"minecraft:efficiency",lvl:1}]} 1
text: Build a waterfall in your Minecraft world.

Key Elements of the YAML File:

custom_init_commands:
- Specifies commands to initialize the Minecraft environment for the task.
- Examples:
  - /give @s minecraft:water_bucket 3: Gives the agent three water buckets.
  - /give @s minecraft:stone 64: Provides a stack of stone blocks.
- These commands ensure the agent has the necessary tools and resources to perform the task.
text:
- Provides a natural language description of the task.
- Example: "Build a waterfall in your Minecraft world."

Running Tests#

Individual or Small-Scale Tests:
- Use test.py for running specific tasks or testing new configurations.
```
$ python test.py
```
Batch Testing with Parallelization:
- Use test_pipeline.py for executing tasks in parallel.
```
$ python test_pipeline.py
```

An Example: `test.py`#

This script demonstrates how to evaluate tasks using YAML-based configurations. Below is an outline of its workflow:

Task Setup:
- Load configuration files from task_configs/simple.
- Parse YAML files into callbacks using convert_yaml_to_callbacks.
Environment Initialization:
- Use MinecraftSim to create a simulation environment.
- Add callbacks:
  - RecordCallback: Saves video frames for evaluation.
  - CommandsCallback: Initializes the environment.
  - TaskCallback: Implements task-specific behavior.
Task Execution:
- Reset the environment and run the task for multiple steps.
- Save observations, actions, and outputs for analysis.
Result Storage:
- Videos and logs are saved in the output/ directory.

commands_callback, task_callback = convert_yaml_to_callbacks("./task_configs/simple/build_waterfall.yaml")
env = MinecraftSim(
   obs_size=(128, 128), 
   callbacks=[
         RecordCallback(record_path=f"./output/", fps=30, frame_type="pov"),
         CommandsCallback(commands_callback),
         TaskCallback(task_callback),
   ]
)
policy = load_vpt_policy(
   model_path="/nfs-shared/jarvisbase/pretrained/foundation-model-2x.model",
   weights_path="/nfs-shared/jarvisbase/pretrained/foundation-model-2x.weights"
).to("cuda")

obs, info = env.reset()
for i in range(12000):
   action, memory = policy.get_action(obs, memory, input_shape='*')
   obs, reward, terminated, truncated, info = env.step(action)
env.close()