Rollout#

This document delves into the rollout process within the online training module, a critical phase where the agent interacts with the environment to gather experiences.

Design Principle#

The rollout module is engineered for efficient, large-scale experience collection, leveraging the power of Ray for distributed execution. Its design revolves around a few key components and philosophies.

At the most granular level, the EnvWorker (Environment Worker) is responsible for running an individual instance of the Minecraft simulation. Multiple EnvWorkers operate in parallel, each generating a stream of raw interactions.

Orchestrating these EnvWorkers is the RolloutWorker. This component manages a collection of environments and houses the current policy model. Its primary duties include spawning and overseeing its EnvWorkers, performing inference (i.e., using the policy model to decide actions based on observations from EnvWorkers), stepping the environments forward by relaying these actions, and collecting the resultant data (observations, actions, rewards, and done states). The RolloutWorker also communicates with an EpisodeStatistics actor to report metrics. For flexibility, a progress_handler can be specified; this callback is invoked with step-wise data and is typically used to channel the collected experiences towards a replay buffer.

To manage the lifecycle and distributed nature of RolloutWorkers, they are wrapped by RolloutWorkerWrapper Ray actors. Each RolloutWorkerWrapper is responsible for a single RolloutWorker. It handles model updates pushed from the trainer and manages communication with the replay buffer. A key function of the wrapper is to buffer the steps collected by its RolloutWorker and assemble them into SampleFragments – coherent chunks of trajectory data.

Overseeing the entire distributed fleet of RolloutWorkerWrappers is the RolloutManager, itself a Ray actor. This central coordinator manages the creation and supervision of all RolloutWorkerWrappers. It plays a crucial role in Model Synchronization: when the trainer produces an updated model, it’s sent to the RolloutManager, which then efficiently distributes these new model weights to all its RolloutWorkerWrappers. These wrappers, in turn, update their local RolloutWorker instances. The system also manages data staleness, ensuring that workers don’t continue to generate data with overly outdated policies. The RolloutManager also interfaces with the ReplayBuffer (often via a ReplayBufferInterface) and the EpisodeStatistics actor.

The Data Flow is designed for parallelism and efficiency: EnvWorkers run environment instances and send observations to their RolloutWorker. The RolloutWorker performs inference, sends actions back, and collects step data. This data is passed to its RolloutWorkerWrapper via the progress_handler. The RolloutWorkerWrapper buffers these steps, assembles them into SampleFragments, and then typically uses a separate internal queue and sender thread to push these fragments to the ReplayBufferInterface, ready for the trainer.

Logic#

The rollout process unfolds through a sequence of coordinated actions, primarily orchestrated by Ray actors:

1. Initialization: Setting the Stage#

The process begins with the creation of the RolloutManager Ray actor. This manager immediately takes on the task of spawning multiple RolloutWorkerWrapper Ray actors. Each of these wrappers, upon its own initialization, creates and configures a local RolloutWorker instance. Finally, each RolloutWorker launches several EnvWorker processes. Each EnvWorker is an independent simulation of the Minecraft environment, ready to interact with an agent.

Note

This hierarchical setup (RolloutManager -> RolloutWorkerWrapper (actor) -> RolloutWorker -> EnvWorker (process)) allows for scalable and resilient distributed data collection. Ray’s actor model handles the complexities of inter-process communication and scheduling across potentially many machines.

2. The Experience Collection Loop: Agents in Action#

The core of the rollout happens within each RolloutWorker. This is a continuous loop: The RolloutWorker actively polls its associated EnvWorkers for new observations from the game. Once enough observations are gathered to form an inference batch (the size of which is configurable), the RolloutWorker uses its current policy model to perform inference, generating actions for each environment. These actions are then dispatched back to the respective EnvWorkers, which apply them to their Minecraft instances and step the simulation forward.

The resulting data from each step – the new observation, the action taken, the reward received, and whether the episode terminated or was truncated – is then passed by the RolloutWorker to its progress_handler. As mentioned, this handler is typically a method within the parent RolloutWorkerWrapper.

3. Data Buffering and Fragment Creation: Packaging Experiences#

Inside the RolloutWorkerWrapper, the progress_handler (called by the RolloutWorker) appends the incoming step data to an internal buffer, usually specific to each EnvWorker it manages. When the buffer for a particular environment accumulates enough steps (typically fragment_length + 1, where fragment_length is the desired size of a data chunk), a SampleFragment is created. This fragment encapsulates the first fragment_length steps, forming a coherent piece of trajectory. This newly minted SampleFragment is then placed into an internal queue within the RolloutWorkerWrapper.

Tip

Using an internal queue and a separate sender thread within the RolloutWorkerWrapper decouples the data collection (which is often synchronous with the environment steps) from the data transmission to the replay buffer. This can improve throughput and prevent the environment simulation from stalling if the replay buffer is temporarily slow to ingest data.

A dedicated sender thread running within the RolloutWorkerWrapper continuously monitors this queue. It takes SampleFragments from the queue and adds them to the ReplayBufferInterface, making them available for the training process.

4. Model Updates: Staying Current#

To ensure the agent learns from the most effective behaviors, its policy model must be kept up-to-date. The Trainer component, after performing its optimization steps, periodically sends new model weights to the RolloutManager. The RolloutManager then efficiently broadcasts these updated weights to all its managed RolloutWorkerWrapper actors. Upon receiving new weights, each RolloutWorkerWrapper instructs its local RolloutWorker to load them into its policy model.

A max_staleness parameter is often used to manage data quality. If a RolloutWorkerWrapper finds that the data it’s collecting is based on a model version that is too old compared to the latest version received, it might discard that data or reset its buffers to avoid polluting the replay buffer with experiences from a significantly outdated policy.

5. Monitoring Performance and Progress: Statistics#

Keeping track of the rollout process is vital. Two main components handle this: The EpisodeStatistics actor, a central Ray actor, is responsible for tracking and logging episode-level metrics. As EnvWorkers complete episodes (via their RolloutWorker), they report information like total rewards and episode lengths to EpisodeStatistics. This provides a high-level view of agent performance.

At a more granular level, the PipelineMonitor (often residing within each RolloutWorker) tracks performance metrics of the data collection pipeline itself. This can include timings for different stages, such as the time spent receiving observations from environments, the duration of the inference step, and the time taken to send actions back. This information is invaluable for identifying bottlenecks and optimizing the rollout throughput.

In essence, the RolloutManager acts as the conductor of this distributed orchestra of workers and actors. It ensures that all components are working in concert, using up-to-date policies, and that the valuable experience data generated by the agent’s interactions is efficiently collected, packaged, and stored for learning.