dqn#
Deep Q-Network (DQN) Agents
Classes#
Deep Q-Network (DQN) agent for reinforcement learning.
Hyperparameters for the DQN agent.
DQN Policy that outputs Q-values for each action given a state.
Double DQN agent for reinforcement learning.
- class prt_rl.model_free.dqn.DQNAgent(policy: DQNPolicy, config: DQNConfig = DQNConfig(buffer_size=1000000, min_buffer_size=10000, mini_batch_size=32, learning_rate=0.1, gamma=0.99, max_grad_norm=None, target_update_freq=1, polyak_tau=None, train_freq=1, gradient_steps=1), *, device: str = 'cpu')[source]#
Deep Q-Network (DQN) agent for reinforcement learning.
- Parameters:
alpha (float, optional) – Learning rate. Defaults to 0.1.
gamma (float, optional) – Discount factor. Defaults to 0.99.
buffer_size (int, optional) – Size of the replay buffer. Defaults to 1_000_000.
min_buffer_size (int, optional) – Minimum size of the replay buffer before training. Defaults to 10_000.
mini_batch_size (int, optional) – Size of the mini-batch for training. Defaults to 32.
max_grad_norm (float, optional) – Maximum gradient norm for clipping. Defaults to None.
target_update_freq (int, optional) – Frequency of target network updates. Defaults to None.
polyak_tau (float, optional) – Polyak averaging coefficient for target network updates. Defaults to None.
decision_function (EpsilonGreedy, optional) – Decision function for action selection. Defaults to EpsilonGreedy(epsilon=0.1).
replay_buffer (BaseReplayBuffer, optional) – Replay buffer for storing experiences. Defaults to None.
device (str, optional) – Device for computation (‘cpu’ or ‘cuda’). Defaults to ‘cuda’.
References: [1] https://openai.com/index/openai-baselines-dqn/ [2] openai/baselines [3] Mnih et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
- act(obs: Tensor, deterministic: bool = False) Tensor[source]#
Perform an action based on the current state.
- Parameters:
obs (torch.Tensor) – The current observation of the environment.
deterministic (bool) – If True, the agent will select actions deterministically.
- Returns:
The action to be taken.
- Return type:
- classmethod load(path: str | Path, map_location: str | device = 'cpu') DQNAgent[source]#
Loads the checkpoint and returns a fully-constructed DQNAgent.
- train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] | None = None, logger: Logger | None = None, evaluator: Evaluator = <prt_rl.common.evaluators.Evaluator object>, show_progress: bool = True) None[source]#
Train the DQN agent. :param env: The environment to train on. :type env: EnvironmentInterface :param total_steps: Total number of steps to train the agent. :type total_steps: int :param schedulers: List of schedulers to update during training. Defaults to None. :type schedulers: List[ParameterScheduler], optional :param logger: Logger to log training metrics. Defaults to None. :type logger: Logger, optional :param evaluator: Evaluator to evaluate the agent periodically. :type evaluator: Evaluator :param show_progress: If True, show a progress bar during training. :type show_progress: bool
- class prt_rl.model_free.dqn.DQNConfig(buffer_size: int = 1000000, min_buffer_size: int = 10000, mini_batch_size: int = 32, learning_rate: float = 0.1, gamma: float = 0.99, max_grad_norm: float | None = None, target_update_freq: int = 1, polyak_tau: float | None = None, train_freq: int = 1, gradient_steps: int = 1)[source]#
Hyperparameters for the DQN agent.
- Parameters:
buffer_size (int) – Size of the replay buffer. Default is 1_000_000.
min_buffer_size (int) – Minimum size of the replay buffer before training. Default is 10_000.
mini_batch_size (int) – Size of the mini-batch for training. Default is 32.
learning_rate (float) – Learning rate for the optimizer. Default is 0.1.
gamma (float) – Discount factor for future rewards. Default is 0.99.
max_grad_norm (float) – Maximum gradient norm for gradient clipping. Default is None.
target_update_freq (int) – Frequency of target network updates. Default is 1.
polyak_tau (float) – Polyak averaging coefficient for target network updates. Default is None.
train_freq (int) – Frequency of training steps. Default is 1.
gradient_steps (int) – Number of gradient steps per training iteration. Default is 1.
- class prt_rl.model_free.dqn.DQNPolicy(network: Module, decision_function: DecisionFunction)[source]#
DQN Policy that outputs Q-values for each action given a state.
- Parameters:
network (nn.Module) – Neural network that processes the input state and outputs a latent representation.
device (str) – Device to run the policy on (‘cpu’ or ‘cuda’).
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- act(obs: Tensor, deterministic: bool = False) Tuple[Tensor, Dict[str, Tensor]][source]#
Return an action tensor and auxiliary policy outputs.
- class prt_rl.model_free.dqn.DoubleDQNAgent(policy: DQNPolicy, config: DQNConfig = DQNConfig(buffer_size=1000000, min_buffer_size=10000, mini_batch_size=32, learning_rate=0.1, gamma=0.99, max_grad_norm=None, target_update_freq=1, polyak_tau=None, train_freq=1, gradient_steps=1), *, device: str = 'cpu')[source]#
Double DQN agent for reinforcement learning.
- Parameters:
alpha (float, optional) – Learning rate. Defaults to 0.1.
gamma (float, optional) – Discount factor. Defaults to 0.99.
buffer_size (int, optional) – Size of the replay buffer. Defaults to 1_000_000.
min_buffer_size (int, optional) – Minimum size of the replay buffer before training. Defaults to 10_000.
mini_batch_size (int, optional) – Size of the mini-batch for training. Defaults to 32.
max_grad_norm (float, optional) – Maximum gradient norm for clipping. Defaults to None.
target_update_freq (int, optional) – Frequency of target network updates. Defaults to None.
polyak_tau (float, optional) – Polyak averaging coefficient for target network updates. Defaults to None.
decision_function (EpsilonGreedy, optional) – Decision function for action selection. Defaults to EpsilonGreedy(epsilon=0.1).
device (str, optional) – Device for computation (‘cpu’ or ‘cuda’). Defaults to ‘cuda’.
References: [1] Curt-Park/rainbow-is-all-you-need
- act(obs: Tensor, deterministic: bool = False) Tensor#
Perform an action based on the current state.
- Parameters:
obs (torch.Tensor) – The current observation of the environment.
deterministic (bool) – If True, the agent will select actions deterministically.
- Returns:
The action to be taken.
- Return type:
- classmethod load(path: str | Path, map_location: str | device = 'cpu') DQNAgent#
Loads the checkpoint and returns a fully-constructed DQNAgent.
- train(env: EnvironmentInterface, total_steps: int, schedulers: List[ParameterScheduler] | None = None, logger: Logger | None = None, evaluator: Evaluator = <prt_rl.common.evaluators.Evaluator object>, show_progress: bool = True) None#
Train the DQN agent. :param env: The environment to train on. :type env: EnvironmentInterface :param total_steps: Total number of steps to train the agent. :type total_steps: int :param schedulers: List of schedulers to update during training. Defaults to None. :type schedulers: List[ParameterScheduler], optional :param logger: Logger to log training metrics. Defaults to None. :type logger: Logger, optional :param evaluator: Evaluator to evaluate the agent periodically. :type evaluator: Evaluator :param show_progress: If True, show a progress bar during training. :type show_progress: bool