Multi-armed Bandits (RL Intro Chapter 2)

By Tyler Gilman, CITING: (Richard S. Sutton and Andrew G. Barto) on Apr 7, 2025

Organized Notes on Multi-Armed Bandits and Reinforcement Learning

Core Concepts

Multi-Armed Bandit Problem

  • k levers, at each time step t = 1, 2, 3… choose to pull one lever A_t from set of k possible actions
  • Receive reward R_t after taking A_t
  • R_t is sampled IID from payout distribution of arm A_t
  • Independence: pull of a particular machine is independent of previous pulls on the same machine

Bandit vs. Reinforcement Learning

  • RL agent reasons about long-term consequences (bandit optimizes immediate return)
  • Bandit has no state (though contextual bandit version exists)
  • Contextual bandit is RL with single timestep reward

Q-values (Action Values)

  • True value of action a: q_*(a) - expected reward when taking action a
  • Estimated value: Q_t(a) - approximation based on experience
  • As t approaches infinity, Qt(a) converges to q*(a) (with Monte Carlo algorithms)

Exploration vs. Exploitation

  • Exploitation: choose actions with highest estimated value (greedy)
  • Exploration: choose other actions to improve estimates
  • Need to balance both for optimal performance

Action Selection Methods

Greedy Selection

  • Always choose action with highest estimated value
  • A_t = argmax_a Q_t(a)
  • Good for exploitation, poor for exploration

ε-Greedy

  • With probability 1-ε: take greedy action (exploit)
  • With probability ε: take action uniformly at random (explore)
  • Simple and effective approach
  • Performance depends on value of ε and time horizon

Upper Confidence Bound (UCB)

  • Selects actions based on estimated value and uncertainty
  • Balances exploration and exploitation more effectively than ε-greedy
  • Less practical for general RL settings and nonstationary problems

Gradient Bandits

  • Learn numerical preferences H_t(a) for each action
  • Use softmax to convert preferences to probabilities
  • Update preferences using stochastic gradient ascent

Optimistic Initial Values

  • Set initial Q-values higher than expected rewards
  • Encourages exploration initially
  • Works well for stationary problems, less suitable for nonstationary ones

Learning Methods

Sample-Average Method

  • Q_t(a) is average of all rewards received when taking action a
  • Simple but requires unbounded memory

Incremental Implementation

  • Update estimates incrementally to reduce memory and computation
  • General update rule: New Estimate ← Old Estimate + Step Size[Target - Old Estimate]
  • [Target - Old Estimate] is the error

Constant Step-Size

  • For nonstationary problems, recent rewards should be weighted more heavily
  • Use constant step-size parameter α instead of 1/n
  • Creates exponentially recency-weighted average

Associative Search (Contextual Bandits)

  • Intermediate between n-armed bandit and full RL
  • Learn policy mapping situations to best actions
  • Actions affect immediate reward but not next situation

Mathematical Formulas

  1. Q-value (True action value):

    $$q_*(a)=E[R_t|A_t=a]$$

  2. Estimated Q-value (Sample average):

    $$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$

  3. Formal estimated Q-value:

    $$Q_t(a) = \frac{\sum^{t-1}_{i=1}R_i * I_{A_{i=a}}}{\sum^{t-1}_{i=1}I_{A_{i=a}}}$$

  4. Incremental Q-value update:

    $$Q_{n+1}=Q_n + \frac{1}{n}[R_n - Q_n]$$

  5. General update rule:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize}[\text{Target} - \text{OldEstimate}]$$

  6. Constant step-size update:

    $$Q_{n+1} = Q_n + \alpha[R_n-Q_n]$$

  7. Expanded constant step-size update:

    $$Q_{k+1} = (1-\alpha)^kQ_1 + \sum^{k}_{i=1}\alpha(1-\alpha)^{k-i}R_i$$

  8. Convergence conditions:

    $$\sum^\infty_{n=1}\alpha_n(a)=\infty \text{ and } \sum^\infty_{n=1} \alpha^2_n(a) < \infty$$

  9. UCB action selection:

    $$A_t = \text{argmax}_a[Q_t(a)+c \sqrt{\frac{\ln t}{N_t(a)}}]$$

  10. Softmax action selection probability:

    $$Pr\{A_t=a\}=\frac{e^{H_t(a)}}{\sum^{n}_{b=1}e^{H_t(b)}} = \pi_t(a)$$

  11. Gradient bandit update (selected action):

    $$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$
    0

  12. Gradient bandit update (non-selected actions):

    $$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$
    1

  13. Gradient ascent update:

    $$H_{t+1}(a)=H_t(a) + \alpha \frac{\partial E[R_t]}{\partial H_t(a)}$$

  14. Expected reward:

    $$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$
    3

  15. Partial derivative for gradient bandit:

    $$\frac{\partial \pi_t(b)}{\partial H_t(a)} = \pi_t(b)(I_{a=b}-\pi_t(a))$$