ppo

ppo#

Proximal Policy Optimization (PPO)

Reference: [1] https://arxiv.org/abs/1707.06347

Classes#

PPOAgent

Proximal Policy Optimization (PPO)

PPOConfig

Configuration for the PPO agent.

PPOPolicy

PPOPolicy is a policy that combines an actor and a critic network.

class prt_rl.model_free.ppo.PPOAgent(policy: PPOPolicy, config: PPOConfig = PPOConfig(steps_per_batch=2048, mini_batch_size=32, learning_rate=0.0003, gamma=0.99, epsilon=0.1, gae_lambda=0.95, entropy_coef=0.01, value_coef=0.5, num_optim_steps=10, normalize_advantages=False), *, device: str = 'cpu')[source]#

Proximal Policy Optimization (PPO)

Parameters:

policy (PPOPolicy) – Policy to use.
config (PPOConfig) – Configuration for the PPO agent.
device (str) – Device to run the computations on (‘cpu’ or ‘cuda’).

act(obs: Tensor, deterministic: bool = False) → Tensor[source]#

Perform an action based on the current state.

Parameters:

obs (torch.Tensor) – The current observation from the environment.
deterministic (bool) – If True, the agent will select actions deterministically.

Returns:

The action to be taken.

Return type:

torch.Tensor

classmethod load(path: str | Path, map_location: str | device = 'cpu') → PPOAgent[source]#: Loads the checkpoint and returns a fully-constructed PPOAgent.

train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] | None = None, logger: Logger | None = None, evaluator: Evaluator | None = None, show_progress: bool = True) → None[source]#

Train the PPO agent.

Parameters:

env (EnvironmentInterface) – The environment to train on.
total_steps (int) – Total number of steps to train for.
schedulers (Optional[List[ParameterScheduler]]) – Learning rate schedulers.
logger (Optional[Logger]) – Logger for training metrics.
evaluator (Optional[Evaluator]) – Evaluator for performance evaluation.
show_progress (bool) – If True, show a progress bar during training.

class prt_rl.model_free.ppo.PPOConfig(steps_per_batch: int = 2048, mini_batch_size: int = 32, learning_rate: float = 0.0003, gamma: float = 0.99, epsilon: float = 0.1, gae_lambda: float = 0.95, entropy_coef: float = 0.01, value_coef: float = 0.5, num_optim_steps: int = 10, normalize_advantages: bool = False)[source]#

Configuration for the PPO agent.

Parameters:

steps_per_batch (int) – Number of steps to collect per batch.
mini_batch_size (int) – Size of mini-batches for optimization.
learning_rate (float) – Learning rate for the optimizer.
gamma (float) – Discount factor for future rewards.
epsilon (float) – Clipping parameter for PPO.
gae_lambda (float) – Lambda parameter for Generalized Advantage Estimation.
entropy_coef (float) – Coefficient for the entropy term in the loss function.
value_coef (float) – Coefficient for the value loss term in the loss function.
num_optim_steps (int) – Number of optimization steps per batch.
normalize_advantages (bool) – Whether to normalize advantages.

class prt_rl.model_free.ppo.PPOPolicy(*, network: Module, actor_head: DistributionHead, critic_head: ValueHead, critic_network: Module | None = None)[source]#

PPOPolicy is a policy that combines an actor and a critic network. It can optionally use an encoder network to process the input state before passing it to the actor and critic heads.

The PPOPolicy is a combination of a DistributionPolicy for the actor and a ValueCritic for the critic. It can handle both discrete and continuous action spaces.

The architecture of the policy is as follows:

Encoder Network (optional): Processes the input state.
Actor Head: Computes actions based on the latent state.
Critic Head: Computes the value for the given state.

Parameters:

env_params (EnvParams) – Environment parameters.
encoder (BaseEncoder | None) – Encoder network to process the input state. If None, the input state is used directly.
actor (DistributionPolicy | None) – Actor network to compute actions. If None, a default DistributionPolicy is created.
critic (ValueCritic | None) – Critic network to compute values. If None, a default ValueCritic is created.
share_encoder (bool) – If True, share the encoder between actor and critic. Default is False.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

act(obs: Tensor, deterministic: bool = False) → Tuple[Tensor, Dict[str, Tensor]][source]#

Returns action + info dict.

Info dict keys (typical):

“log_prob”: (B,1)
“value”: (B,1)

evaluate_actions(obs: Tensor, action: Tensor) → Tuple[Tensor, Tensor, Tensor][source]#

Used during PPO optimization.

Returns:: (B,1) log_prob: (B,1) entropy: (B,1)
Return type:: value

forward(obs: Tensor, deterministic: bool = False) → Tensor[source]#: Convenience: treat the policy like a normal nn.Module that outputs actions. Collectors should call act() instead to get info dict.

get_state_value(obs: Tensor) → Tensor[source]#

Returns the state value for the given observation.

Parameters:: obs – (B, obs_dim)
Returns:: (B,1)
Return type:: value

metadata() → Dict[str, Any]#: Optionally save metadata alongside the policy. This is a no-op in the base class but can be overridden by subclasses.

`PPOAgent`	Proximal Policy Optimization (PPO)
`PPOConfig`	Configuration for the PPO agent.
`PPOPolicy`	PPOPolicy is a policy that combines an actor and a critic network.

ppo

Contents

ppo#

Classes#