Multi-armed bandit
- Classic reinforcement learning scenario
- Illustrates the exploration-exploitation dilemma
- Imagine a row of slot machines (sometimes known as “one-armed bandits”)
- Each machine has different unknown probabilities of paying out
- Which machines to play? How many times to play each machine? What order to play? Continue with the current machine or try a different machine?
Multi-armed bandit
- Arms – pulling different slot machine levers (actions)
- Reward – payoffs you receive after each pull
- Regret – difference between your actual rewards and what you could have earned by always pulling the optimal arm
Multi-armed bandit
- Only one state, so no state transitions
- Actions don’t affect the environment’s future states
- Focus is on the exploration-exploitation trade-off
Multi-armed bandit – strategies
- ε-greedy: Choose the best-known arm most of the time (1-ε), but occasionally (ε) select a random arm
- Upper Confidence Bound (UCB): Select arms based on their potential upside, factoring in uncertainty
- Thompson Sampling: Choose arms proportionally to the probability they are optimal, using Bayesian updates
- Softmax: Select arms with probability proportional to their estimated values
Multi-armed bandit – real-world Applications
- A/B testing in websites
- Clinical trials for medical treatments
- Online advertising (which ad to show)
- Resource allocation problems
- Recommendation systems