bandits

bandits#

Classes#

class prt_sim.jhu.bandits.KArmBandits(num_bandits: int = 10)[source]#

K-arm Bandits simulation

The k-arm bandit problem chooses the true value $q_*(a)$ of each of the actions according to a normal distribution with mean zero and unit variance. The actual rewards are selected according to a mean $q*(a)$ and unit variance normal distribution. They chose average reward vs steps and percent optimal action vs steps as the metrics to track.

Parameters:: num_bandits (int) – Number of random bandits

References

[1] Sutton, Barto: Introduction to Reinforcement Learning Edition 2, p29

Examples:

execute_action(action: int) → Tuple[int, float, bool][source]#

Executes the action and a step in the environment.

Parameters:: action (int) – bandit to play
Returns:: (state, reward, done) the reward is the only relevant value
Return type:: tuple

get_number_of_actions() → int[source]#

Returns the number of actions which is equal to the number of bandits

Returns:: number of actions
Return type:: int

get_number_of_states() → int[source]#

Returns the number of states

Returns:: number of states
Return type:: int

get_optimal_bandit() → int[source]#

Returns the optimal bandit. This should not be used by the agent, but only for evaluation purposes.

Returns:: optimal bandit index
Return type:: int

render()#: Renders the environment

reset(seed: int | None = None, randomize_start: bool | None = False) → int[source]#

Resets the bandits probabilities randomly or with provided values.

Parameters:

seed (int, optional) – Random seed. Defaults to None.
randomize_start (bool, optional) – Whether to randomize the starting state. Not all environments will support this. Defaults to False.

Returns:

current state value

Return type:

int

bandits

Contents

bandits#

Classes#