Multi-armed bandit

Multi-armed bandit

Classic reinforcement learning scenario
Illustrates the exploration-exploitation dilemma
Imagine a row of slot machines (sometimes known as “one-armed bandits”)
Each machine has different unknown probabilities of paying out
Which machines to play? How many times to play each machine? What order to play? Continue with the current machine or try a different machine?

Multi-armed bandit

Arms – pulling different slot machine levers (actions)
Reward – payoffs you receive after each pull
Regret – difference between your actual rewards and what you could have earned by always pulling the optimal arm

Multi-armed bandit

Only one state, so no state transitions
Actions don’t affect the environment’s future states
Focus is on the exploration-exploitation trade-off

Multi-armed bandit – strategies

ε-greedy: Choose the best-known arm most of the time (1-ε), but occasionally (ε) select a random arm
Upper Confidence Bound (UCB): Select arms based on their potential upside, factoring in uncertainty
Thompson Sampling: Choose arms proportionally to the probability they are optimal, using Bayesian updates
Softmax: Select arms with probability proportional to their estimated values

Multi-armed bandit – real-world Applications

A/B testing in websites
Clinical trials for medical treatments
Online advertising (which ad to show)
Resource allocation problems
Recommendation systems

Multi-armed bandit – implementation

Implementation in Python