Multi-armed bandit

Multi-armed bandit

  • Classic reinforcement learning scenario
  • Illustrates the exploration-exploitation dilemma
  • Imagine a row of slot machines (sometimes known as “one-armed bandits”)
  • Each machine has different unknown probabilities of paying out
  • Which machines to play? How many times to play each machine? What order to play? Continue with the current machine or try a different machine?

Multi-armed bandit

  • Arms – pulling different slot machine levers (actions)
  • Reward – payoffs you receive after each pull
  • Regret – difference between your actual rewards and what you could have earned by always pulling the optimal arm

Multi-armed bandit

  • Only one state, so no state transitions
  • Actions don’t affect the environment’s future states
  • Focus is on the exploration-exploitation trade-off

Multi-armed bandit – strategies

  • ε-greedy: Choose the best-known arm most of the time (1-ε), but occasionally (ε) select a random arm
  • Upper Confidence Bound (UCB): Select arms based on their potential upside, factoring in uncertainty
  • Thompson Sampling: Choose arms proportionally to the probability they are optimal, using Bayesian updates
  • Softmax: Select arms with probability proportional to their estimated values

Multi-armed bandit – real-world Applications

  • A/B testing in websites
  • Clinical trials for medical treatments
  • Online advertising (which ad to show)
  • Resource allocation problems
  • Recommendation systems

Multi-armed bandit – implementation

Implementation in Python