dqn#

Deep Q-Network (DQN) Agents

Classes#

DQNAgent

Deep Q-Network (DQN) agent for reinforcement learning.

DQNConfig

Hyperparameters for the DQN agent.

DQNPolicy

DQN Policy that outputs Q-values for each action given a state.

DoubleDQNAgent

Double DQN agent for reinforcement learning.

class prt_rl.model_free.dqn.DQNAgent(policy: DQNPolicy, config: DQNConfig = DQNConfig(buffer_size=1000000, min_buffer_size=10000, mini_batch_size=32, learning_rate=0.1, gamma=0.99, max_grad_norm=None, target_update_freq=1, polyak_tau=None, train_freq=1, gradient_steps=1), *, device: str = 'cpu')[source]#

Deep Q-Network (DQN) agent for reinforcement learning.

Parameters:
  • alpha (float, optional) – Learning rate. Defaults to 0.1.

  • gamma (float, optional) – Discount factor. Defaults to 0.99.

  • buffer_size (int, optional) – Size of the replay buffer. Defaults to 1_000_000.

  • min_buffer_size (int, optional) – Minimum size of the replay buffer before training. Defaults to 10_000.

  • mini_batch_size (int, optional) – Size of the mini-batch for training. Defaults to 32.

  • max_grad_norm (float, optional) – Maximum gradient norm for clipping. Defaults to None.

  • target_update_freq (int, optional) – Frequency of target network updates. Defaults to None.

  • polyak_tau (float, optional) – Polyak averaging coefficient for target network updates. Defaults to None.

  • decision_function (EpsilonGreedy, optional) – Decision function for action selection. Defaults to EpsilonGreedy(epsilon=0.1).

  • replay_buffer (BaseReplayBuffer, optional) – Replay buffer for storing experiences. Defaults to None.

  • device (str, optional) – Device for computation (‘cpu’ or ‘cuda’). Defaults to ‘cuda’.

References: [1] https://openai.com/index/openai-baselines-dqn/ [2] openai/baselines [3] Mnih et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.

act(obs: Tensor, deterministic: bool = False) Tensor[source]#

Perform an action based on the current state.

Parameters:
  • obs (torch.Tensor) – The current observation of the environment.

  • deterministic (bool) – If True, the agent will select actions deterministically.

Returns:

The action to be taken.

Return type:

torch.Tensor

classmethod load(path: str | Path, map_location: str | device = 'cpu') DQNAgent[source]#

Loads the checkpoint and returns a fully-constructed DQNAgent.

train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] | None = None, logger: Logger | None = None, evaluator: Evaluator = <prt_rl.common.evaluators.Evaluator object>, show_progress: bool = True) None[source]#

Train the DQN agent. :param env: The environment to train on. :type env: EnvironmentInterface :param total_steps: Total number of steps to train the agent. :type total_steps: int :param schedulers: List of schedulers to update during training. Defaults to None. :type schedulers: List[ParameterScheduler], optional :param logger: Logger to log training metrics. Defaults to None. :type logger: Logger, optional :param evaluator: Evaluator to evaluate the agent periodically. :type evaluator: Evaluator :param show_progress: If True, show a progress bar during training. :type show_progress: bool

class prt_rl.model_free.dqn.DQNConfig(buffer_size: int = 1000000, min_buffer_size: int = 10000, mini_batch_size: int = 32, learning_rate: float = 0.1, gamma: float = 0.99, max_grad_norm: float | None = None, target_update_freq: int = 1, polyak_tau: float | None = None, train_freq: int = 1, gradient_steps: int = 1)[source]#

Hyperparameters for the DQN agent.

Parameters:
  • buffer_size (int) – Size of the replay buffer. Default is 1_000_000.

  • min_buffer_size (int) – Minimum size of the replay buffer before training. Default is 10_000.

  • mini_batch_size (int) – Size of the mini-batch for training. Default is 32.

  • learning_rate (float) – Learning rate for the optimizer. Default is 0.1.

  • gamma (float) – Discount factor for future rewards. Default is 0.99.

  • max_grad_norm (float) – Maximum gradient norm for gradient clipping. Default is None.

  • target_update_freq (int) – Frequency of target network updates. Default is 1.

  • polyak_tau (float) – Polyak averaging coefficient for target network updates. Default is None.

  • train_freq (int) – Frequency of training steps. Default is 1.

  • gradient_steps (int) – Number of gradient steps per training iteration. Default is 1.

class prt_rl.model_free.dqn.DQNPolicy(network: Module, decision_function: DecisionFunction)[source]#

DQN Policy that outputs Q-values for each action given a state.

Parameters:
  • network (nn.Module) – Neural network that processes the input state and outputs a latent representation.

  • device (str) – Device to run the policy on (‘cpu’ or ‘cuda’).

Initialize internal Module state, shared by both nn.Module and ScriptModule.

act(obs: Tensor, deterministic: bool = False) Tuple[Tensor, Dict[str, Tensor]][source]#

Return an action tensor and auxiliary policy outputs.

get_q_values(obs: Tensor) Tensor[source]#

Return the Q-values for the given observation.

get_target_q_values(obs: Tensor) Tensor[source]#

Return the Q-values from the target network for the given observation.

metadata() Dict[str, Any]#

Optionally save metadata alongside the policy. This is a no-op in the base class but can be overridden by subclasses.

class prt_rl.model_free.dqn.DoubleDQNAgent(policy: DQNPolicy, config: DQNConfig = DQNConfig(buffer_size=1000000, min_buffer_size=10000, mini_batch_size=32, learning_rate=0.1, gamma=0.99, max_grad_norm=None, target_update_freq=1, polyak_tau=None, train_freq=1, gradient_steps=1), *, device: str = 'cpu')[source]#

Double DQN agent for reinforcement learning.

Parameters:
  • alpha (float, optional) – Learning rate. Defaults to 0.1.

  • gamma (float, optional) – Discount factor. Defaults to 0.99.

  • buffer_size (int, optional) – Size of the replay buffer. Defaults to 1_000_000.

  • min_buffer_size (int, optional) – Minimum size of the replay buffer before training. Defaults to 10_000.

  • mini_batch_size (int, optional) – Size of the mini-batch for training. Defaults to 32.

  • max_grad_norm (float, optional) – Maximum gradient norm for clipping. Defaults to None.

  • target_update_freq (int, optional) – Frequency of target network updates. Defaults to None.

  • polyak_tau (float, optional) – Polyak averaging coefficient for target network updates. Defaults to None.

  • decision_function (EpsilonGreedy, optional) – Decision function for action selection. Defaults to EpsilonGreedy(epsilon=0.1).

  • device (str, optional) – Device for computation (‘cpu’ or ‘cuda’). Defaults to ‘cuda’.

References: [1] Curt-Park/rainbow-is-all-you-need

act(obs: Tensor, deterministic: bool = False) Tensor#

Perform an action based on the current state.

Parameters:
  • obs (torch.Tensor) – The current observation of the environment.

  • deterministic (bool) – If True, the agent will select actions deterministically.

Returns:

The action to be taken.

Return type:

torch.Tensor

classmethod load(path: str | Path, map_location: str | device = 'cpu') DQNAgent#

Loads the checkpoint and returns a fully-constructed DQNAgent.

train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] | None = None, logger: Logger | None = None, evaluator: Evaluator = <prt_rl.common.evaluators.Evaluator object>, show_progress: bool = True) None#

Train the DQN agent. :param env: The environment to train on. :type env: EnvironmentInterface :param total_steps: Total number of steps to train the agent. :type total_steps: int :param schedulers: List of schedulers to update during training. Defaults to None. :type schedulers: List[ParameterScheduler], optional :param logger: Logger to log training metrics. Defaults to None. :type logger: Logger, optional :param evaluator: Evaluator to evaluate the agent periodically. :type evaluator: Evaluator :param show_progress: If True, show a progress bar during training. :type show_progress: bool