policy_gradient

policy_gradient#

Policy Gradient algorithm#

Example Usage:#

This example demonstrates how to initialize a Policy Gradient agent with a custom policy.

Classes#

PolicyGradientAgent

Policy Gradient agent with step-wise optimization.

PolicyGradientConfig

Hyperparameter Configuration for the Policy Gradient agent.

PolicyGradientPolicy

Base class for Policy Gradient policies.

class prt_rl.model_free.policy_gradient.PolicyGradientAgent(policy: PolicyGradientPolicy, config: PolicyGradientConfig = PolicyGradientConfig(batch_size=100, learning_rate=0.001, gamma=0.99, gae_lambda=0.95, optim_steps=1, use_reward_to_go=False, use_gae=False, baseline_learning_rate=0.005, baseline_optim_steps=5, normalize_advantages=True), *, device: str = 'cpu')[source]#

Policy Gradient agent with step-wise optimization.

Example

from prt_rl import PolicyGradient
from prt_rl.common.policies import DistributionPolicy

# Setup the environment
# env = ...

# Configure the Algorithm Hyperparameters
config = PolicyGradientConfig(
    batch_size=1000,
    learning_rate=5e-3,
    gamma=1.0,
    use_reward_to_go=True,
    normalize_advantages=True,
)

# Configure Policy Gradient Policy
policy = DistributionPolicy(env_params=env.get_parameters())

# Create Agent
agent = PolicyGradient(policy=policy, config=config)

# Train the agent
agent.train(env=env, total_steps=num_iterations * config.batch_size)

Parameters:

config (PolicyGradientConfig) – Configuration for the Policy Gradient agent.
policy (PolicyGradientPolicy) – The policy to be used by the agent.
device (str) – Device to run the agent on (e.g., ‘cpu’ or ‘cuda’). Default is ‘cpu’.

act(obs: Tensor, deterministic: bool = False) → Tensor[source]#

Perform an action based on the current state.

Parameters:

obs (torch.Tensor) – The current observation from the environment.
deterministic (bool) – If True, the agent will select actions deterministically.

Returns:

The action to be taken.

Return type:

torch.Tensor

classmethod load(path: str | Path, map_location: str | device = 'cpu') → PolicyGradientAgent[source]#: Loads the checkpoint and returns a fully-constructed PolicyGradientAgent.

train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] = [], logger: Logger | None = None, evaluator: Evaluator | None = None, show_progress: bool = True) → None[source]#

Train the PolicyGradient agent using the provided environment

Parameters:

env (EnvironmentInterface) – The environment in which the agent will operate.
total_steps (int) – Total number of training steps to perform.
schedulers (List[ParameterScheduler]) – List of parameter schedulers to update during training.
logger (Optional[Logger]) – Logger for logging training progress. If None, a default logger will be created.
evaluator (Evaluator) – Evaluator to evaluate the agent periodically.
show_progress (bool) – If True, show a progress bar during training.

class prt_rl.model_free.policy_gradient.PolicyGradientConfig(batch_size: int = 100, learning_rate: float = 0.001, gamma: float = 0.99, gae_lambda: float = 0.95, optim_steps: int = 1, use_reward_to_go: bool = False, use_gae: bool = False, baseline_learning_rate: float = 0.005, baseline_optim_steps: int = 5, normalize_advantages: bool = True)[source]#

Hyperparameter Configuration for the Policy Gradient agent.

Parameters:

batch_size (int) – Size of the batch for training. Default is 100.
learning_rate (float) – Learning rate for the optimizer. Default is 1e-3.
gamma (float) – Discount factor for future rewards. Default is 0.99.
gae_lambda (float) – Lambda parameter for Generalized Advantage Estimation. Default is 0.95.
optim_steps (int) – Number of optimization steps per training iteration. Default is 1.
reward_to_go (bool) – Whether to use rewards-to-go instead of total discounted return. Default is False.
use_gae (bool) – Whether to use Generalized Advantage Estimation. Default is False.
baseline_learning_rate (float) – Learning rate for the baseline network if used. Default is 5e-3.
baseline_optim_steps (int) – Number of optimization steps for the baseline network. Default is 5.
normalize_advantages (bool) – Whether to normalize advantages before training. Default is True.

class prt_rl.model_free.policy_gradient.PolicyGradientPolicy(network: Module, actor_head: DistributionHead, critic_head: ValueHead | None = None, *, critic_network: Module | None = None)[source]#

Base class for Policy Gradient policies. This class can be extended to create custom policies for the Policy Gradient agent. The policy should output a distribution over actions given the current state.

Parameters:

network (nn.Module) – The neural network that processes the input state and outputs a latent representation.
actor_head (DistributionHead) – The head that takes the latent representation from the network and outputs a distribution over actions.
critic_head (ValueHead) – The head that takes the latent representation from the network and outputs a value estimate for the state.
device (str) – Device to run the policy on (e.g., ‘cpu’ or ‘cuda’). Default is ‘cpu’.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

act(obs: Tensor, deterministic: bool = False) → Tuple[Tensor, Dict[str, Tensor]][source]#

Returns action + info dict.

Info dict keys (typical):

“log_prob”: (B,1)
“value”: (B,1)

get_state_value(obs: Tensor) → Tensor[source]#

Returns the state value for the given observation.

Parameters:: obs – (B, obs_dim)
Returns:: (B,1)
Return type:: value

metadata() → Dict[str, Any]#: Optionally save metadata alongside the policy. This is a no-op in the base class but can be overridden by subclasses.

`PolicyGradientAgent`	Policy Gradient agent with step-wise optimization.
`PolicyGradientConfig`	Hyperparameter Configuration for the Policy Gradient agent.
`PolicyGradientPolicy`	Base class for Policy Gradient policies.

policy_gradient

Contents

policy_gradient#

Policy Gradient algorithm#

Example Usage:#

Classes#