a2c#

Implementation of the Advantage Actor-Critic (A2C) algorithm.

Classes#

A2CAgent

Advantage Actor-Critic (A2C) agent implementation.

A2CConfig

Configuration parameters for the A2C agent.

A2CPolicy

Initialize internal Module state, shared by both nn.Module and ScriptModule.

class prt_rl.model_free.a2c.A2CAgent(policy: A2CPolicy, config: A2CConfig = A2CConfig(steps_per_batch=2048, mini_batch_size=32, learning_rate=0.0003, gamma=0.99, gae_lambda=0.95, entropy_coef=0.01, value_coef=0.5, max_grad_norm=0.5, normalize_advantages=False), *, device: str = 'cpu')[source]#

Advantage Actor-Critic (A2C) agent implementation.

Parameters:
  • policy (A2CPolicy) – The policy network used by the agent.

  • config (A2CConfig) – Configuration parameters for the A2C agent.

  • device (str) – The device to run the agent on (‘cpu’ or ‘cuda’).

act(obs: Tensor, deterministic: bool = False) Tensor[source]#

Perform an action based on the current state.

Parameters:
  • obs (torch.Tensor) – The current observation from the environment.

  • deterministic (bool) – If True, the agent will select actions deterministically.

Returns:

The action to be taken.

Return type:

torch.Tensor

classmethod load(path: str | Path, map_location: str | device = 'cpu') A2CAgent[source]#

Loads the checkpoint and returns a fully-constructed A2CAgent.

train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] | None = None, logger: Logger | None = None, evaluator: Evaluator | None = None, show_progress: bool = True) None[source]#

Train the A2C agent.

Parameters:
  • env (EnvironmentInterface) – The environment to train on.

  • total_steps (int) – Total number of steps to train for.

  • schedulers (Optional[List[ParameterScheduler]]) – Learning rate schedulers.

  • logger (Optional[Logger]) – Logger for training metrics.

  • evaluator (Optional[Evaluator]) – Evaluator for performance evaluation.

  • show_progress (bool) – If True, show a progress bar during training.

class prt_rl.model_free.a2c.A2CConfig(steps_per_batch: int = 2048, mini_batch_size: int = 32, learning_rate: float = 0.0003, gamma: float = 0.99, gae_lambda: float = 0.95, entropy_coef: float = 0.01, value_coef: float = 0.5, max_grad_norm: float = 0.5, normalize_advantages: bool = False)[source]#

Configuration parameters for the A2C agent.

Variables:
  • steps_per_batch (int) – Number of steps to collect per training batch.

  • mini_batch_size (int) – Size of each mini-batch for optimization.

  • learning_rate (float) – Learning rate for the optimizer.

  • gamma (float) – Discount factor for future rewards.

  • gae_lambda (float) – Lambda parameter for Generalized Advantage Estimation.

  • entropy_coef (float) – Coefficient for the entropy bonus.

  • value_coef (float) – Coefficient for the value loss.

  • normalize_advantages (bool) – Whether to normalize advantages.

class prt_rl.model_free.a2c.A2CPolicy(network: Module, actor_head: DistributionHead, critic_head: ValueHead)[source]#

Initialize internal Module state, shared by both nn.Module and ScriptModule.

act(obs: Tensor, deterministic: bool = False) Tuple[Tensor, Dict[str, Tensor]][source]#

Returns action + info dict.

Info dict keys (typical):
  • “log_prob”: (B,1)

  • “value”: (B,1)

evaluate_actions(obs: Tensor, action: Tensor) Tuple[Tensor, Tensor, Tensor][source]#

Used during A2C optimization.

Returns:

(B,1) log_prob: (B,1) entropy: (B,1)

Return type:

value

forward(obs: Tensor, deterministic: bool = False) Tensor[source]#

Convenience: treat the policy like a normal nn.Module that outputs actions. Collectors should call act() instead to get info dict.

metadata() Dict[str, Any]#

Optionally save metadata alongside the policy. This is a no-op in the base class but can be overridden by subclasses.