policy_gradient#
Policy Gradient algorithm#
Example Usage:#
This example demonstrates how to initialize a Policy Gradient agent with a custom policy.
Classes#
Policy Gradient agent with step-wise optimization.
Hyperparameter Configuration for the Policy Gradient agent.
Base class for Policy Gradient policies.
- class prt_rl.model_free.policy_gradient.PolicyGradientAgent(policy: PolicyGradientPolicy, config: PolicyGradientConfig = PolicyGradientConfig(batch_size=100, learning_rate=0.001, gamma=0.99, gae_lambda=0.95, optim_steps=1, use_reward_to_go=False, use_gae=False, baseline_learning_rate=0.005, baseline_optim_steps=5, normalize_advantages=True), *, device: str = 'cpu')[source]#
Policy Gradient agent with step-wise optimization.
Example
from prt_rl import PolicyGradient from prt_rl.common.policies import DistributionPolicy # Setup the environment # env = ... # Configure the Algorithm Hyperparameters config = PolicyGradientConfig( batch_size=1000, learning_rate=5e-3, gamma=1.0, use_reward_to_go=True, normalize_advantages=True, ) # Configure Policy Gradient Policy policy = DistributionPolicy(env_params=env.get_parameters()) # Create Agent agent = PolicyGradient(policy=policy, config=config) # Train the agent agent.train(env=env, total_steps=num_iterations * config.batch_size)
- Parameters:
config (PolicyGradientConfig) – Configuration for the Policy Gradient agent.
policy (PolicyGradientPolicy) – The policy to be used by the agent.
device (str) – Device to run the agent on (e.g., ‘cpu’ or ‘cuda’). Default is ‘cpu’.
- act(obs: Tensor, deterministic: bool = False) Tensor[source]#
Perform an action based on the current state.
- Parameters:
obs (torch.Tensor) – The current observation from the environment.
deterministic (bool) – If True, the agent will select actions deterministically.
- Returns:
The action to be taken.
- Return type:
- classmethod load(path: str | Path, map_location: str | device = 'cpu') PolicyGradientAgent[source]#
Loads the checkpoint and returns a fully-constructed PolicyGradientAgent.
- train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] = [], logger: Logger | None = None, evaluator: Evaluator | None = None, show_progress: bool = True) None[source]#
Train the PolicyGradient agent using the provided environment
- Parameters:
env (EnvironmentInterface) – The environment in which the agent will operate.
total_steps (int) – Total number of training steps to perform.
schedulers (List[ParameterScheduler]) – List of parameter schedulers to update during training.
logger (Optional[Logger]) – Logger for logging training progress. If None, a default logger will be created.
evaluator (Evaluator) – Evaluator to evaluate the agent periodically.
show_progress (bool) – If True, show a progress bar during training.
- class prt_rl.model_free.policy_gradient.PolicyGradientConfig(batch_size: int = 100, learning_rate: float = 0.001, gamma: float = 0.99, gae_lambda: float = 0.95, optim_steps: int = 1, use_reward_to_go: bool = False, use_gae: bool = False, baseline_learning_rate: float = 0.005, baseline_optim_steps: int = 5, normalize_advantages: bool = True)[source]#
Hyperparameter Configuration for the Policy Gradient agent.
- Parameters:
batch_size (int) – Size of the batch for training. Default is 100.
learning_rate (float) – Learning rate for the optimizer. Default is 1e-3.
gamma (float) – Discount factor for future rewards. Default is 0.99.
gae_lambda (float) – Lambda parameter for Generalized Advantage Estimation. Default is 0.95.
optim_steps (int) – Number of optimization steps per training iteration. Default is 1.
reward_to_go (bool) – Whether to use rewards-to-go instead of total discounted return. Default is False.
use_gae (bool) – Whether to use Generalized Advantage Estimation. Default is False.
baseline_learning_rate (float) – Learning rate for the baseline network if used. Default is 5e-3.
baseline_optim_steps (int) – Number of optimization steps for the baseline network. Default is 5.
normalize_advantages (bool) – Whether to normalize advantages before training. Default is True.
- class prt_rl.model_free.policy_gradient.PolicyGradientPolicy(network: Module, actor_head: DistributionHead, critic_head: ValueHead | None = None, *, critic_network: Module | None = None)[source]#
Base class for Policy Gradient policies. This class can be extended to create custom policies for the Policy Gradient agent. The policy should output a distribution over actions given the current state.
- Parameters:
network (nn.Module) – The neural network that processes the input state and outputs a latent representation.
actor_head (DistributionHead) – The head that takes the latent representation from the network and outputs a distribution over actions.
critic_head (ValueHead) – The head that takes the latent representation from the network and outputs a value estimate for the state.
device (str) – Device to run the policy on (e.g., ‘cpu’ or ‘cuda’). Default is ‘cpu’.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- act(obs: Tensor, deterministic: bool = False) Tuple[Tensor, Dict[str, Tensor]][source]#
Returns action + info dict.
- Info dict keys (typical):
“log_prob”: (B,1)
“value”: (B,1)