Open-World Skill Discovery from Unsegmented Demonstrations

Abstract

Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long and unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on random splitting or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in the Minecraft environment, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average relative performance of two conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents.

Pipeline

Our method SBD for discovering skills from unsegmented demonstration videos consists of four stages:

Stage I: An unconditional Transformer-XL based policy model is pretrained on an unsegmented dataset to predict future actions (labeled by an inverse dynamics model) based on past observations using behavioral cloning.

Stage II: The pretrained unconditional policy will produce a high predicted action loss when encountering uncertain observations (e.g., deciding whether to kill a new sheep) in open worlds. These timesteps should be marked as skill boundaries, indicating the need for additional instructions to control behaviors. We segment the long unsegmented videos into a series of short atomic skill demonstrations.

Stage III: We train a conditional Transformer-XL based policy model on the segmented dataset to master a variety of atomic skills.

Stage IV: Finally, we use hierarchical methods (a combination of vision-language models and the conditional policy) to model the long demonstration videos and follow long-horizon instructions.

Segmented Videos of Different Skills

Tool Use

Use Furnace (Smelt Food)

Use bed (Sleep)

Use Torch

Use Boat

Use Bow

Use Flint

Open and Close Door

Hunting and Combating

Hunt Sheep

Hunt Pig

Hunt Cow

Combat Zombie

Combat Spider

Collecting

Collect Dirt

Collect Grass

Collect Wood

Collect Seagrass

Collect Stone

Collect Iron

Collect Diamond

Examples of Youtube Videos

0. ready to go down

1. dig and go into the river

2. dive down

3. find the blocks

4. collect the blocks and go up

5. digging everywhere

6. back to shore

0. go upstairs

1. crafting

2. open the door and look outside

3. go downstairs

4. fetch the granite in the inventory

5. block the stairs and sleep

6. go outside and dig 3 pits

7. digging beside the water

8. go back home

9. craft a bucket

10. craft a hoe

11. an enderman at home!

12. observing...

13. go out and collect some water and put in a pit

14. recollect the water and fill the pit

15. put the water in the first pit

16. collect some water

17. put the water in the first pit

18. collect some water and put in the second pit

19. collect some water and put in the second pit(cont.)

20. collect some water and put in the third pit

21. look into the house through the windows

22. the enderman still at home!

23. go back home (like a thief) and eat some potatoes

24. dig a pit

25. deeper...

26. larger...

27. deeper...

28. bury the enderman (hooray!)

29. put the granite back to chest

Comparison of Segmentation Methods

*Green squares represent observations, and red circles represent actions.

Existing methods usually rely on additional rules, while our method is learning-based.

Random splitting methods divide videos into segments of predefined lengths (e.g., fixed length, uniform distribution). However, they do not ensure that each segment contains a distinct skill. Additionally, predefined lengths may not match the actual distribution of skill length in real world.

Reward-driven methods discover skills through the environment’s reward signal. They are limited by their inability to capture skills with no associated rewards and by the risk of splitting one skill into multiple segments when rewards are repeatedly gained during execution.

Top-down methods rely on predefined skill sets from human experts. They use manual labeling or supervised learning to segment videos. Although this method can produce reasonable results, it is expensive and limited by the narrow range of predefined skills.

Bottom-up methods use algorithms such as agglomerative clustering or byte-pair encoding (BPE) to split action sequences. However, they struggle in visually partially observable settings where both observations and actions must be considered.

Our method SBD is annotation-free, adaptive at identifying diverse skills, and highly effective in open-world scenarios.

Experiment Results

Short-Horizon Atomic Tasks

We select basic skills such as chop down trees and advanced skills like smelt items with furnace in Minecraft as evaluation benchmarks. We test 12 different skill sets designed in MCU. Each task is tested over 100 times, except for sleep in bed and use bow, which are evaluated 10 times using human rating because they cannot be automatically verified with in-game information. Scores with % indicate success rates, while those without % represent rewards for the corresponding tasks (for finer-grained assessment of easier tasks). The controllers showed substantial improvements across most tasks, with an average relative performance enhancement of 63.7% for GROOT and 52.1% for STEVE-1.

Long-horizon Programmatic Tasks

Programmatic tasks require the agent to start from an empty inventory in a new world until obtaining the final required items, such as “obtain an iron pickaxe from scratch”, which is usually a chain of atomic tasks. To further verify that the improvements in controllers enhance the ability of hierarchical agents, we evaluate Omnijarvis and JARVIS-1 on groups of programmatic tasks selected from the original papers. All tasks are tested over 30 times. In each group, the agent is required to obtain a certain type of items from scratch or be given an iron pickaxe. For example, the diamond group includes diamond pickaxe, diamond sword, jukebox, etc. The agents have substantial improvements across most of the tasks, achieving an average relative performance enhancement of 11.3% for Omnijarvis and 20.8% for JARVIS-1.