td3

td3#

Twin Delayed Deep Deterministic Policy Gradient (TD3)

Classes#

TD3Agent

Twin Delayed Deep Deterministic Policy Gradient (TD3)

TD3Config

Configuration for the TD3 agent.

TD3Policy

TD3 Policy

class prt_rl.model_free.td3.TD3Agent(policy: TD3Policy, config: TD3Config = TD3Config(buffer_size=100000, min_buffer_size=1000, steps_per_batch=1, mini_batch_size=256, gradient_steps=1, learning_rate=0.001, gamma=0.99, exploration_noise=0.1, policy_noise=0.2, noise_clip=0.5, delay_freq=2, tau=0.005), *, device: str = 'cpu')[source]#

Twin Delayed Deep Deterministic Policy Gradient (TD3)

This class implements the TD3 algorithm, which is an off-policy actor-critic algorithm for continuous action spaces.

Parameters:

policy (TD3Policy | None) – Custom TD3 policy. If None, a default TD3 policy will be created.
config (TD3Config) – Configuration for the TD3 agent.
device (str) – Device to run the agent on (‘cpu’ or ‘cuda’). Default is ‘cpu’.

act(obs: Tensor, deterministic: bool = False) → Tensor[source]#

Perform an action based on the current state.

Parameters:

obs (torch.Tensor) – The current observation of the environment.
deterministic (bool) – If True, the agent will select actions deterministically.

Returns:

The action to be taken.

Return type:

torch.Tensor

classmethod load(path: str | Path, map_location: str | device = 'cpu') → TD3Agent[source]#: Loads the checkpoint and returns a fully-constructed TD3Agent.

train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] | None = None, logger: Logger | None = None, evaluator: Evaluator | None = None, show_progress: bool = True) → None[source]#

Update the agent’s knowledge based on the action taken and the received reward. This method should implement the TD3 training loop.

Parameters:

env – The environment to interact with.
total_steps – Total number of steps to train the agent.
schedulers – Optional list of parameter schedulers.
logger – Optional logger for logging training progress.
evaluator – Evaluator for evaluating the agent’s performance.
show_progress – If True, show a progress bar during training.

class prt_rl.model_free.td3.TD3Config(buffer_size: int = 100000, min_buffer_size: int = 1000, steps_per_batch: int = 1, mini_batch_size: int = 256, gradient_steps: int = 1, learning_rate: float = 0.001, gamma: float = 0.99, exploration_noise: float = 0.1, policy_noise: float = 0.2, noise_clip: float = 0.5, delay_freq: int = 2, tau: float = 0.005)[source]#

Configuration for the TD3 agent.

Parameters:

buffer_size (int) – Size of the replay buffer.
min_buffer_size (int) – Minimum size of the replay buffer before training starts.
steps_per_batch (int) – Number of steps to collect per batch.
mini_batch_size (int) – Size of the mini-batch sample for each gradient update.
gradient_steps (int) – Number of gradient steps to take per training iteration.
learning_rate (float) – Learning rate for the optimizer.
gamma (float) – Discount factor for future rewards.
exploration_noise (float) – Standard deviation of Gaussian noise added to actions for exploration.
policy_noise (float) – Standard deviation of noise added to the target policy’s actions.
noise_clip (float) – Maximum absolute value of noise added to the target policy’s actions.
delay_freq (int) – Frequency of delayed policy updates.
tau (float) – Polyak averaging factor for target networks.
num_critics (int) – Number of critic networks to use.

class prt_rl.model_free.td3.TD3Policy(network: Module, actor_head: ContinuousHead, critic_head: QValueHead, *, action_min: Tensor, action_max: Tensor, num_critics: int = 2, exploration_noise: float = 0.1, critic_network: Module | None = None)[source]#

TD3 Policy

This class implements the TD3 policy, which consists of an actor network and multiple critic networks. The actor network is used to select actions, while the critic networks are used to evaluate the actions. The policy can share the encoder with the actor and critic networks if specified.

Parameters:

env_params (EnvParams) – Environment parameters.
num_critics (int) – Number of critic networks to use. Default is 2.
actor (Optional[ContinuousPolicy]) – Custom actor network. If None, a default actor will be created.
critic (Optional[StateActionCritic]) – Custom critic network. If None, a default critic will be created.
share_encoder (bool) – Whether to share the encoder between actor and critic networks. Default is True.
device (str) – Device to run the policy on (‘cpu’ or ‘cuda’). Default is ‘cpu’.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

act(obs: Tensor, deterministic: bool = False) → Tuple[Tensor, Dict[str, Tensor]][source]#: Return an action tensor and auxiliary policy outputs.

get_q_values(obs: Tensor, action: Tensor, index: int | None = None) → Tensor[source]#

Get Q-values from all critics for the given state-action pairs.

Parameters:

obs (torch.Tensor) – Current observation tensor.
action (torch.Tensor) – Action tensor.

Returns:

Tensor containing Q-values from all critics. Shape (B, C, 1) where C is the number of critics.

Return type:

torch.Tensor

get_target_q_values(obs: Tensor, action: Tensor) → Tensor[source]#

Get target Q-values from all target critics for the given state-action pairs.

Parameters:

obs (torch.Tensor) – Current observation tensor.
action (torch.Tensor) – Action tensor.

Returns:

Tensor containing target Q-values from all critics. Shape (B, C, 1) where C is the number of critics.

Return type:

torch.Tensor

metadata()[source]#: Optionally save metadata alongside the policy. This is a no-op in the base class but can be overridden by subclasses.

target_actor_action(obs: Tensor, policy_noise: float, noise_clip: float, action_shape) → Tensor[source]#

Compute the target actor’s action with added noise for policy smoothing. :param obs: The current observation of the environment. :type obs: torch.Tensor :param policy_noise: Standard deviation of noise added to the target policy’s actions. :type policy_noise: float :param noise_clip: Maximum absolute value of noise added to the target policy’s actions. :type noise_clip: float :param action_shape: Shape of the action tensor.

Returns:: The action computed by the target actor with added noise, clipped to action bounds.
Return type:: torch.Tensor

`TD3Agent`	Twin Delayed Deep Deterministic Policy Gradient (TD3)
`TD3Config`	Configuration for the TD3 agent.
`TD3Policy`	TD3 Policy

td3

Contents

td3#

Classes#