JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Abstract

Achieving human-like planning and control with visual observations in an open world is a key milestone for more functional generalist agents. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Additionally, we show that our approach outperforms traditional imitation learning-based VLA models in Minecraft. We have open-sourced the code, models, and datasets to foster further research.

Master thousands of skills in Minecraft.

JARVIS-VLA can execute human language instructions in open-world Minecraft. We illustrate executions with different instructions below.

Diverse Instructions

Smelt iron ingots from iron ores

Craft iron swords with iron ingots

Kill sheeps with sword

Mine large ferns

Collect torch

Cook popped chorus fruit in furnace

JARVIS-VLA can reflect on its own actions and re-generate new actions.

Reflections

Craft gray wool

Craft chiseled sandstone

More Results

Below we share some additional results of JARVIS-VLA on Minecraft.

Cook Harvest Kill Mobs Mine Combat Smelt Craft

Cook

Cooking foods with furnace

cook baked potato

cook chicken

cook beef

cook cod

Harvest

Harvesting plants

harvest grass

harvest melon

harvest sugar cane

harvest sunflower

Kill

Kill mobs

kill chicken

kill cow

Mine

Mine ore with pickaxe

mine diamond ore

mine iron ore

mine coal ore

mine mossy stone bricks

Mine blocks with tool

Chop acacia door

Mine purple bed

mine sand with shovel

Mine spruce logs

Combat

Combating monsters

combat skeleton

combat spider

combat zombie

combat zombie with pumpkin

Smelt

Smelting ores

smelt brick

smelt nether brick

smelt stone

smelt terracotta

Crafting Items

Armors crafting

craft diamond leggings

craft gold helmet

craft iron chestplate

craft leather boots

Tools crafting

craft diamond sword

craft gold hoe

craft iron shovel

craft wooden pickaxe

Craft in the inventory

craft diamond from diamond block

craft emerald from emerald block

craft melon seeds from melon piece

craft stone button from stone

Craft in the inventory with recipe book

craft glowstone

craft stone pressure plate

Craft on the crafting table

craft acacia boat from acacia planks

craft oak slab

craft paper from sugar cane

craft spruce boat from spruce planks

Craft on the table with recipe book

craft ladder from sticks

craft dark oak fence gate from dark oak planks

craft oak trapdoor

craft warped stairs

Others

craft crafting table

craft dark prismarine stair

craft diorite slab

craft torch

Long rollout

long video

mine ores

survive in night

Failure cases analysis

Failures

craft arrow

smelt charcoal

combat creeper

Combat enderman

mine sand slab

craft iron axe

Datasets

JARVIS-VLA models can post-train on various vision-language datasets using a unified tokenizer and support diverse vision-language applications, such as question answering, image captioning, image/video question answering, visual grounding with bounding box, and decision-making. More examples can be found in huggingface datasets repository.

Related Projects

Check out some of our related projects below!

	OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents NeurIPS 2024 An hierarchical open-ended agent based on latent space that can answer questions and follow instructions in open-world Minecraft. We use self-supervised behavior tokenizer to encode the action sequences and build the Vision-Language-Action (VLA) models upon the pretrained vision language models.
	JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models T-PAMI 2024 This work introduces JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. JARVIS-1 is capable of completing over 200 different tasks and performs exceptionally well in short-horizon tasks.
	Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents NeurIPS 2023 DEPS is an interactive planning approach based on Large Language Models (LLMs) for open-ended multi-task agents. It helps with better error correction from the feedback during the long-haul planning, while also bringing the sense of proximity via goal Selector, a learnable module that ranks parallel sub-goals based on the estimated steps of completion and improves the original plan accordingly.

BibTex


@article{li2025jarvisvla,
  title   = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse},
  author  = {Muyao Li and Zihao Wang and Kaichen He and Xiaojian Ma and Yitao Liang},
  journal = {arXiv:2503.16365},
  year    = {2025}
}