Automatic Evaluation Pipeline#

The pipeline automates evaluation tasks in the MineStudio framework, enabling the generation of criteria and video evaluation for agent performance analysis.

Code Structure#

The following is the structure of the evaluation module, with each file and folder serving specific purposes:

auto_eval/
    ├── criteria_files/
    │   └── Contains criteria files for evaluating videos for each task.
    ├── eval_video/
    │   └── Stores example videos and provides a structure for saving task-specific evaluation videos.
    ├── batch_video_rating.py
    │   └── Batch evaluation of videos for task performance.
    ├── individual_video_rating.py
    │   └── Evaluate videos individually for detailed analysis.
    ├── video_comparison.py
        └── Compare videos to measure performance differences.

Evaluating Videos with Vision-Language Models#

An Example: Comparing Videos Using `video_comparison.py`#

Below is a simplified guide to using video_comparison.py for video comparison:

Prepare Videos: Ensure your video files (e.g., video_a.mp4 and video_b.mp4) are placed in the eval_video/ directory.
Define Criteria: Define task-specific criteria files in criteria_files/ (e.g., build_gate.txt).

Run the Script: Use the following command to compare two videos:

python video_comparison.py \
  --video_path_a='./eval_video/build_gate/build_gate_5.mp4' \
  --video_path_b='./eval_video/build_gate/build_gate_7.mp4' \
  --criteria_path='./auto_eval/criteria_files/build_gate.txt'

Analyze Results: After running the script, the evaluation results will be saved as a JSON file in the vlm_rating_res/ directory.

The following is an example output, showcasing how two videos are compared across several evaluation criteria. Each criterion is explained with observations, and an overall assessment is provided.

[
    {
        "Task Progress": "B is better",
        "Action Control": "B is better",
        "Error Recognition and Correction": "B is better",
        "Creative Attempts": "tie",
        "Task Completion Efficiency": "B is better",
        "Material Selection and Usage": "tie",
        "video_1_path": "./eval_video/build_gate_5.mp4",
        "video_2_path": "./eval_video/build_gate_7.mp4"
    },
    "Task Progress:\n- Video B constructs two pillars and an arch; A does not complete the arch.\nresult: B is better\n\nAction Control:\n- Video A shows more wandering and redundant actions.\nresult: B is better\n\nError Recognition and Correction:\n- Video B corrects structure misalignments.\nresult: B is better\n\nCreative Attempts:\n- Neither video shows creative elements like decorations.\nresult: tie\n\nTask Completion Efficiency:\n- Video B completes the task faster and more efficiently.\nresult: B is better\n\nMaterial Selection and Usage:\n- Both use oak planks appropriately.\nresult: tie\n"
]

Assessment Dimensions:
- Task Progress: Measures how much of the task is completed.
- Action Control: Assesses movement precision and avoidance of redundant actions.
- Error Recognition and Correction: Evaluates the agent’s ability to detect and fix mistakes.
- Creative Attempts: Considers innovative or decorative efforts beyond task requirements.
- Task Completion Efficiency: Tracks speed and resourcefulness in completing the task.
- Material Selection and Usage: Ensures appropriate materials are used.
Structured Results:
- The first section provides a concise summary of the evaluation for each criterion.
- Example:
  - "Task Progress": "B is better"
  - "Creative Attempts": "tie"
Detailed Observations:
- The second section explains the reasoning behind each result.
- Example:
  - Task Progress: “Video B constructs two pillars and an arch; A does not complete the arch.”
  - Creative Attempts: “Neither video shows creative elements like decorations.”

Organizing Files for Batch Evaluation#

Evaluate all videos in a directory using their respective criteria.

python batch_video_rating.py \
  --videos_path='./eval_video/' \
  --criteria_files_path='./auto_eval/criteria_files/'

Organize your task-specific videos under the videos_path directory:

videos_path     
├── build_waterfall     # task_name_1     
│     ├── episode_1.mp4
│     ├── episode_2.mp4
├── build_house         # task_name_2
│     ├── episode_1.mp4
│     ├── episode_2.mp4
├── task_name_3
│     ├── episode_1.mp4
│     ├── episode_2.mp4

Store criteria files under the criteria_files_path directory, matching the task names:

criteria_files_path     
├── build_waterfall.txt # task_name_1     
├── build_house.txt     # task_name_2
├── task_name_3.txt

Example Commands#

To evaluate task performance using pre-recorded videos and criteria, you can use the following commands depending on your needs:

Compare Two Videos:

Compare two videos of the same task to analyze differences in agent performance.

python video_comparison.py \
  --video_path_a='./eval_video/build_gate/build_gate_5.mp4' \
  --video_path_b='./eval_video/build_gate/build_gate_7.mp4' \
  --criteria_path='./auto_eval/criteria_files/build_gate.txt'

Individual Video Evaluation:

Evaluate a single video against predefined criteria.

python individual_video_rating.py \
  --video_path='./eval_video/build_gate/build_gate_5.mp4' \
  --criteria_path='./auto_eval/criteria_files/build_gate.txt'

Batch Video Evaluation: