bandits#

Classes#

class prt_sim.jhu.bandits.KArmBandits(num_bandits: int = 10)[source]#

K-arm Bandits simulation

The k-arm bandit problem chooses the true value $q_*(a)$ of each of the actions according to a normal distribution with mean zero and unit variance. The actual rewards are selected according to a mean $q*(a)$ and unit variance normal distribution. They chose average reward vs steps and percent optimal action vs steps as the metrics to track.

Parameters:

num_bandits (int) – Number of random bandits

References

[1] Sutton, Barto: Introduction to Reinforcement Learning Edition 2, p29

Examples:

execute_action(action: int) Tuple[int, float, bool][source]#

Executes the action and a step in the environment.

Parameters:

action (int) – bandit to play

Returns:

(state, reward, done) the reward is the only relevant value

Return type:

tuple

get_number_of_actions() int[source]#

Returns the number of actions which is equal to the number of bandits

Returns:

number of actions

Return type:

int

get_number_of_states() int[source]#

Returns the number of states

Returns:

number of states

Return type:

int

get_optimal_bandit() int[source]#

Returns the optimal bandit. This should not be used by the agent, but only for evaluation purposes.

Returns:

optimal bandit index

Return type:

int

render()#

Renders the environment

reset(seed: int | None = None, randomize_start: bool | None = False) int[source]#

Resets the bandits probabilities randomly or with provided values.

Parameters:
  • seed (int, optional) – Random seed. Defaults to None.

  • randomize_start (bool, optional) – Whether to randomize the starting state. Not all environments will support this. Defaults to False.

Returns:

current state value

Return type:

int