Rollout#
This document delves into the rollout process within the online training module, a critical phase where the agent interacts with the environment to gather experiences.
Design Principle#
The rollout module is engineered for efficient, large-scale experience collection, leveraging the power of Ray
for distributed execution. Its design revolves around a few key components and philosophies.
At the most granular level, the EnvWorker
(Environment Worker) is responsible for running an individual instance of the Minecraft simulation. Multiple EnvWorker
s operate in parallel, each generating a stream of raw interactions.
Orchestrating these EnvWorker
s is the RolloutWorker
. This component manages a collection of environments and houses the current policy model. Its primary duties include spawning and overseeing its EnvWorker
s, performing inference (i.e., using the policy model to decide actions based on observations from EnvWorker
s), stepping the environments forward by relaying these actions, and collecting the resultant data (observations, actions, rewards, and done states). The RolloutWorker
also communicates with an EpisodeStatistics
actor to report metrics. For flexibility, a progress_handler
can be specified; this callback is invoked with step-wise data and is typically used to channel the collected experiences towards a replay buffer.
To manage the lifecycle and distributed nature of RolloutWorker
s, they are wrapped by RolloutWorkerWrapper
Ray actors. Each RolloutWorkerWrapper
is responsible for a single RolloutWorker
. It handles model updates pushed from the trainer and manages communication with the replay buffer. A key function of the wrapper is to buffer the steps collected by its RolloutWorker
and assemble them into SampleFragment
s – coherent chunks of trajectory data.
Overseeing the entire distributed fleet of RolloutWorkerWrapper
s is the RolloutManager
, itself a Ray actor. This central coordinator manages the creation and supervision of all RolloutWorkerWrapper
s. It plays a crucial role in Model Synchronization: when the trainer produces an updated model, it’s sent to the RolloutManager
, which then efficiently distributes these new model weights to all its RolloutWorkerWrapper
s. These wrappers, in turn, update their local RolloutWorker
instances. The system also manages data staleness, ensuring that workers don’t continue to generate data with overly outdated policies. The RolloutManager
also interfaces with the ReplayBuffer
(often via a ReplayBufferInterface
) and the EpisodeStatistics
actor.
The Data Flow is designed for parallelism and efficiency: EnvWorker
s run environment instances and send observations to their RolloutWorker
. The RolloutWorker
performs inference, sends actions back, and collects step data. This data is passed to its RolloutWorkerWrapper
via the progress_handler
. The RolloutWorkerWrapper
buffers these steps, assembles them into SampleFragment
s, and then typically uses a separate internal queue and sender thread to push these fragments to the ReplayBufferInterface
, ready for the trainer.
Logic#
The rollout process unfolds through a sequence of coordinated actions, primarily orchestrated by Ray actors:
1. Initialization: Setting the Stage#
The process begins with the creation of the RolloutManager
Ray actor. This manager immediately takes on the task of spawning multiple RolloutWorkerWrapper
Ray actors. Each of these wrappers, upon its own initialization, creates and configures a local RolloutWorker
instance. Finally, each RolloutWorker
launches several EnvWorker
processes. Each EnvWorker
is an independent simulation of the Minecraft environment, ready to interact with an agent.
Note
This hierarchical setup (RolloutManager
-> RolloutWorkerWrapper
(actor) -> RolloutWorker
-> EnvWorker
(process)) allows for scalable and resilient distributed data collection. Ray’s actor model handles the complexities of inter-process communication and scheduling across potentially many machines.
2. The Experience Collection Loop: Agents in Action#
The core of the rollout happens within each RolloutWorker
. This is a continuous loop:
The RolloutWorker
actively polls its associated EnvWorker
s for new observations from the game. Once enough observations are gathered to form an inference batch (the size of which is configurable), the RolloutWorker
uses its current policy model to perform inference, generating actions for each environment. These actions are then dispatched back to the respective EnvWorker
s, which apply them to their Minecraft instances and step the simulation forward.
The resulting data from each step – the new observation, the action taken, the reward received, and whether the episode terminated or was truncated – is then passed by the RolloutWorker
to its progress_handler
. As mentioned, this handler is typically a method within the parent RolloutWorkerWrapper
.
3. Data Buffering and Fragment Creation: Packaging Experiences#
Inside the RolloutWorkerWrapper
, the progress_handler
(called by the RolloutWorker
) appends the incoming step data to an internal buffer, usually specific to each EnvWorker
it manages. When the buffer for a particular environment accumulates enough steps (typically fragment_length + 1
, where fragment_length
is the desired size of a data chunk), a SampleFragment
is created. This fragment encapsulates the first fragment_length
steps, forming a coherent piece of trajectory.
This newly minted SampleFragment
is then placed into an internal queue within the RolloutWorkerWrapper
.
Tip
Using an internal queue and a separate sender thread within the RolloutWorkerWrapper
decouples the data collection (which is often synchronous with the environment steps) from the data transmission to the replay buffer. This can improve throughput and prevent the environment simulation from stalling if the replay buffer is temporarily slow to ingest data.
A dedicated sender thread running within the RolloutWorkerWrapper
continuously monitors this queue. It takes SampleFragment
s from the queue and adds them to the ReplayBufferInterface
, making them available for the training process.
4. Model Updates: Staying Current#
To ensure the agent learns from the most effective behaviors, its policy model must be kept up-to-date. The Trainer
component, after performing its optimization steps, periodically sends new model weights to the RolloutManager
.
The RolloutManager
then efficiently broadcasts these updated weights to all its managed RolloutWorkerWrapper
actors. Upon receiving new weights, each RolloutWorkerWrapper
instructs its local RolloutWorker
to load them into its policy model.
A max_staleness
parameter is often used to manage data quality. If a RolloutWorkerWrapper
finds that the data it’s collecting is based on a model version that is too old compared to the latest version received, it might discard that data or reset its buffers to avoid polluting the replay buffer with experiences from a significantly outdated policy.
5. Monitoring Performance and Progress: Statistics#
Keeping track of the rollout process is vital. Two main components handle this:
The EpisodeStatistics
actor, a central Ray actor, is responsible for tracking and logging episode-level metrics. As EnvWorker
s complete episodes (via their RolloutWorker
), they report information like total rewards and episode lengths to EpisodeStatistics
. This provides a high-level view of agent performance.
At a more granular level, the PipelineMonitor
(often residing within each RolloutWorker
) tracks performance metrics of the data collection pipeline itself. This can include timings for different stages, such as the time spent receiving observations from environments, the duration of the inference step, and the time taken to send actions back. This information is invaluable for identifying bottlenecks and optimizing the rollout throughput.
In essence, the RolloutManager
acts as the conductor of this distributed orchestra of workers and actors. It ensures that all components are working in concert, using up-to-date policies, and that the valuable experience data generated by the agent’s interactions is efficiently collected, packaged, and stored for learning.