policy_gradient#

Policy Gradient algorithm#

Example Usage:#

This example demonstrates how to initialize a Policy Gradient agent with a custom policy.

Classes#

PolicyGradientAgent

Policy Gradient agent with step-wise optimization.

PolicyGradientConfig

Hyperparameter Configuration for the Policy Gradient agent.

PolicyGradientPolicy

Base class for Policy Gradient policies.

class prt_rl.model_free.policy_gradient.PolicyGradientAgent(policy: PolicyGradientPolicy, config: PolicyGradientConfig = PolicyGradientConfig(batch_size=100, learning_rate=0.001, gamma=0.99, gae_lambda=0.95, optim_steps=1, use_reward_to_go=False, use_gae=False, baseline_learning_rate=0.005, baseline_optim_steps=5, normalize_advantages=True), *, device: str = 'cpu')[source]#

Policy Gradient agent with step-wise optimization.

Example

from prt_rl import PolicyGradient
from prt_rl.common.policies import DistributionPolicy

# Setup the environment
# env = ...

# Configure the Algorithm Hyperparameters
config = PolicyGradientConfig(
    batch_size=1000,
    learning_rate=5e-3,
    gamma=1.0,
    use_reward_to_go=True,
    normalize_advantages=True,
)

# Configure Policy Gradient Policy
policy = DistributionPolicy(env_params=env.get_parameters())

# Create Agent
agent = PolicyGradient(policy=policy, config=config)

# Train the agent
agent.train(env=env, total_steps=num_iterations * config.batch_size)
Parameters:
  • config (PolicyGradientConfig) – Configuration for the Policy Gradient agent.

  • policy (PolicyGradientPolicy) – The policy to be used by the agent.

  • device (str) – Device to run the agent on (e.g., ‘cpu’ or ‘cuda’). Default is ‘cpu’.

act(obs: Tensor, deterministic: bool = False) Tensor[source]#

Perform an action based on the current state.

Parameters:
  • obs (torch.Tensor) – The current observation from the environment.

  • deterministic (bool) – If True, the agent will select actions deterministically.

Returns:

The action to be taken.

Return type:

torch.Tensor

classmethod load(path: str | Path, map_location: str | device = 'cpu') PolicyGradientAgent[source]#

Loads the checkpoint and returns a fully-constructed PolicyGradientAgent.

train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] = [], logger: Logger | None = None, evaluator: Evaluator | None = None, show_progress: bool = True) None[source]#

Train the PolicyGradient agent using the provided environment

Parameters:
  • env (EnvironmentInterface) – The environment in which the agent will operate.

  • total_steps (int) – Total number of training steps to perform.

  • schedulers (List[ParameterScheduler]) – List of parameter schedulers to update during training.

  • logger (Optional[Logger]) – Logger for logging training progress. If None, a default logger will be created.

  • evaluator (Evaluator) – Evaluator to evaluate the agent periodically.

  • show_progress (bool) – If True, show a progress bar during training.

class prt_rl.model_free.policy_gradient.PolicyGradientConfig(batch_size: int = 100, learning_rate: float = 0.001, gamma: float = 0.99, gae_lambda: float = 0.95, optim_steps: int = 1, use_reward_to_go: bool = False, use_gae: bool = False, baseline_learning_rate: float = 0.005, baseline_optim_steps: int = 5, normalize_advantages: bool = True)[source]#

Hyperparameter Configuration for the Policy Gradient agent.

Parameters:
  • batch_size (int) – Size of the batch for training. Default is 100.

  • learning_rate (float) – Learning rate for the optimizer. Default is 1e-3.

  • gamma (float) – Discount factor for future rewards. Default is 0.99.

  • gae_lambda (float) – Lambda parameter for Generalized Advantage Estimation. Default is 0.95.

  • optim_steps (int) – Number of optimization steps per training iteration. Default is 1.

  • reward_to_go (bool) – Whether to use rewards-to-go instead of total discounted return. Default is False.

  • use_gae (bool) – Whether to use Generalized Advantage Estimation. Default is False.

  • baseline_learning_rate (float) – Learning rate for the baseline network if used. Default is 5e-3.

  • baseline_optim_steps (int) – Number of optimization steps for the baseline network. Default is 5.

  • normalize_advantages (bool) – Whether to normalize advantages before training. Default is True.

class prt_rl.model_free.policy_gradient.PolicyGradientPolicy(network: Module, actor_head: DistributionHead, critic_head: ValueHead | None = None, *, critic_network: Module | None = None)[source]#

Base class for Policy Gradient policies. This class can be extended to create custom policies for the Policy Gradient agent. The policy should output a distribution over actions given the current state.

Parameters:
  • network (nn.Module) – The neural network that processes the input state and outputs a latent representation.

  • actor_head (DistributionHead) – The head that takes the latent representation from the network and outputs a distribution over actions.

  • critic_head (ValueHead) – The head that takes the latent representation from the network and outputs a value estimate for the state.

  • device (str) – Device to run the policy on (e.g., ‘cpu’ or ‘cuda’). Default is ‘cpu’.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

act(obs: Tensor, deterministic: bool = False) Tuple[Tensor, Dict[str, Tensor]][source]#

Returns action + info dict.

Info dict keys (typical):
  • “log_prob”: (B,1)

  • “value”: (B,1)

get_state_value(obs: Tensor) Tensor[source]#

Returns the state value for the given observation.

Parameters:

obs – (B, obs_dim)

Returns:

(B,1)

Return type:

value

metadata() Dict[str, Any]#

Optionally save metadata alongside the policy. This is a no-op in the base class but can be overridden by subclasses.