Online Training Module: General Information#
Welcome to the MineStudio Online Training Module! This section of the documentation provides a high-level overview of its architecture, core components, and the underlying design philosophy. The online module is engineered to facilitate the training of agents directly within the interactive Minecraft environment, allowing them to learn and adapt through continuous experience.
Core Philosophy: Learning by Doing, at Scale#
The online training pipeline in MineStudio is built with scalability and efficiency in mind. It leverages the power of Ray for distributed computation, enabling you to train agents on complex tasks that may require significant computational resources and vast amounts of interaction data. The central idea is to have agents (policies) that learn by actively engaging with the environment, collecting experiences, and updating their decision-making processes in near real-time.
Architectural Overview: Key Components#
The online training module is primarily organized into three interconnected sub-modules, each residing in its respective subfolder within minestudio/online/
:
run
: This is the entry point for initiating and managing an online training session. It’s responsible for parsing configurations, initializing the necessary Ray actors, and orchestrating the overall workflow. Think of it as the conductor of the online training orchestra.For more details, see the Run documentation.
rollout
: This component is dedicated to the crucial task of experience collection. It manages a fleet of workers that interact with multiple instances of the Minecraft environment in parallel. These workers use the current agent policy to decide actions, observe outcomes, and gather the raw data (observations, actions, rewards, etc.) that forms the basis of learning.For more details, see the Rollout documentation.
trainer
: This is where the learning happens. The trainer takes the experiences collected by therollout
workers and uses them to optimize the agent’s policy. MineStudio primarily features aPPOTrainer
(Proximal Policy Optimization), a robust and widely-used reinforcement learning algorithm.For more details, see the Trainer documentation.
utils
: This directory houses a collection of shared utilities, data structures, and helper functions that support both therollout
andtrainer
components. This promotes code reusability and consistency.For more details, see the Utils documentation.
Interplay of Components: A Simplified Data Flow#
While the detailed interactions are covered in the specific documentation for each component, here’s a simplified view of how they work together:
The
run
script starts the process, initializing theRolloutManager
(from therollout
module) and theTrainer
(e.g.,PPOTrainer
).The
RolloutManager
deploys multipleRolloutWorker
actors. EachRolloutWorker
in turn manages severalEnvWorker
instances, which are the actual Minecraft environment simulations.EnvWorker
s send observations to theirRolloutWorker
.The
RolloutWorker
uses its local copy of the current policy (periodically updated by theTrainer
) to select actions for each of itsEnvWorker
s.Actions are applied in the
EnvWorker
s, and the resulting new observations, rewards, and done states (collectively, a “step” of experience) are sent back to theRolloutWorker
.The
RolloutWorker
groups these steps intoSampleFragment
s (chunks of trajectory data).These
SampleFragment
s are then sent, often via aRolloutWorkerWrapper
and an internal queue, to a Replay Buffer (which can be part of theRolloutManager
or a separate entity it manages).The
Trainer
fetches batches ofSampleFragment
s from the Replay Buffer.The
Trainer
computes advantages (e.g., using GAE) and then performs optimization steps (e.g., PPO updates) to improve the policy and value function models.Periodically, the
Trainer
sends the updated model weights to theRolloutManager
, which then broadcasts them to allRolloutWorker
s, ensuring they use the latest policy for subsequent data collection.This cycle of data collection and training continues, allowing the agent to progressively learn and improve its performance.
Getting Started#
To dive deeper into specific aspects:
Understand how to configure your training runs in the Config section.
For a quick guide on launching a training session, refer to the Quick Start.
If you’re interested in extending or modifying the existing trainers or policies, the Customization page will be your guide.
This modular and distributed architecture is designed to be flexible and scalable, catering to a wide range of research and development needs in the exciting domain of learning agents for Minecraft.