Online Training Module: General Information#
Welcome to the MineStudio Online Training Module! This section of the documentation provides a high-level overview of its architecture, core components, and the underlying design philosophy. The online module is engineered to facilitate the training of agents directly within the interactive Minecraft environment, allowing them to learn and adapt through continuous experience.
Core Philosophy: Learning by Doing, at Scale#
The online training pipeline in MineStudio is built with scalability and efficiency in mind. It leverages the power of Ray for distributed computation, enabling you to train agents on complex tasks that may require significant computational resources and vast amounts of interaction data. The central idea is to have agents (policies) that learn by actively engaging with the environment, collecting experiences, and updating their decision-making processes in near real-time.
Architectural Overview: Key Components#
The online training module is primarily organized into three interconnected sub-modules, each residing in its respective subfolder within minestudio/online/:
run: This is the entry point for initiating and managing an online training session. It’s responsible for parsing configurations, initializing the necessary Ray actors, and orchestrating the overall workflow. Think of it as the conductor of the online training orchestra.For more details, see the Run documentation.
rollout: This component is dedicated to the crucial task of experience collection. It manages a fleet of workers that interact with multiple instances of the Minecraft environment in parallel. These workers use the current agent policy to decide actions, observe outcomes, and gather the raw data (observations, actions, rewards, etc.) that forms the basis of learning.For more details, see the Rollout documentation.
trainer: This is where the learning happens. The trainer takes the experiences collected by therolloutworkers and uses them to optimize the agent’s policy. MineStudio primarily features aPPOTrainer(Proximal Policy Optimization), a robust and widely-used reinforcement learning algorithm.For more details, see the Trainer documentation.
utils: This directory houses a collection of shared utilities, data structures, and helper functions that support both therolloutandtrainercomponents. This promotes code reusability and consistency.For more details, see the Utils documentation.
Interplay of Components: A Simplified Data Flow#
While the detailed interactions are covered in the specific documentation for each component, here’s a simplified view of how they work together:
The
runscript starts the process, initializing theRolloutManager(from therolloutmodule) and theTrainer(e.g.,PPOTrainer).The
RolloutManagerdeploys multipleRolloutWorkeractors. EachRolloutWorkerin turn manages severalEnvWorkerinstances, which are the actual Minecraft environment simulations.EnvWorkers send observations to theirRolloutWorker.The
RolloutWorkeruses its local copy of the current policy (periodically updated by theTrainer) to select actions for each of itsEnvWorkers.Actions are applied in the
EnvWorkers, and the resulting new observations, rewards, and done states (collectively, a “step” of experience) are sent back to theRolloutWorker.The
RolloutWorkergroups these steps intoSampleFragments (chunks of trajectory data).These
SampleFragments are then sent, often via aRolloutWorkerWrapperand an internal queue, to a Replay Buffer (which can be part of theRolloutManageror a separate entity it manages).The
Trainerfetches batches ofSampleFragments from the Replay Buffer.The
Trainercomputes advantages (e.g., using GAE) and then performs optimization steps (e.g., PPO updates) to improve the policy and value function models.Periodically, the
Trainersends the updated model weights to theRolloutManager, which then broadcasts them to allRolloutWorkers, ensuring they use the latest policy for subsequent data collection.This cycle of data collection and training continues, allowing the agent to progressively learn and improve its performance.
Getting Started#
To dive deeper into specific aspects:
Understand how to configure your training runs in the Config section.
For a quick guide on launching a training session, refer to the Quick Start.
If you’re interested in extending or modifying the existing trainers or policies, the Customization page will be your guide.
This modular and distributed architecture is designed to be flexible and scalable, catering to a wide range of research and development needs in the exciting domain of learning agents for Minecraft.