bandits#
Classes#
- class prt_sim.jhu.bandits.KArmBandits(num_bandits: int = 10)[source]#
K-arm Bandits simulation
The k-arm bandit problem chooses the true value $q_*(a)$ of each of the actions according to a normal distribution with mean zero and unit variance. The actual rewards are selected according to a mean $q*(a)$ and unit variance normal distribution. They chose average reward vs steps and percent optimal action vs steps as the metrics to track.
- Parameters:
num_bandits (int) – Number of random bandits
References
[1] Sutton, Barto: Introduction to Reinforcement Learning Edition 2, p29
Examples:
- execute_action(action: int) Tuple[int, float, bool][source]#
Executes the action and a step in the environment.
- get_number_of_actions() int[source]#
Returns the number of actions which is equal to the number of bandits
- Returns:
number of actions
- Return type:
- get_number_of_states() int[source]#
Returns the number of states
- Returns:
number of states
- Return type:
- get_optimal_bandit() int[source]#
Returns the optimal bandit. This should not be used by the agent, but only for evaluation purposes.
- Returns:
optimal bandit index
- Return type:
- render()#
Renders the environment