Multi-armed Bandits (RL Intro Chapter 2)

By Tyler Gilman, CITING: (Richard S. Sutton and Andrew G. Barto) on Apr 7, 2025

RL Programming

Organized Notes on Multi-Armed Bandits and Reinforcement Learning

Core Concepts

Multi-Armed Bandit Problem

k levers, at each time step t = 1, 2, 3… choose to pull one lever A_t from set of k possible actions
Receive reward R_t after taking A_t
R_t is sampled IID from payout distribution of arm A_t
Independence: pull of a particular machine is independent of previous pulls on the same machine

Bandit vs. Reinforcement Learning

RL agent reasons about long-term consequences (bandit optimizes immediate return)
Bandit has no state (though contextual bandit version exists)
Contextual bandit is RL with single timestep reward

Q-values (Action Values)

True value of action a: q_*(a) - expected reward when taking action a
Estimated value: Q_t(a) - approximation based on experience
As t approaches infinity, Qt(a) converges to q*(a) (with Monte Carlo algorithms)

Exploration vs. Exploitation

Exploitation: choose actions with highest estimated value (greedy)
Exploration: choose other actions to improve estimates
Need to balance both for optimal performance

Action Selection Methods

Greedy Selection

Always choose action with highest estimated value
A_t = argmax_a Q_t(a)
Good for exploitation, poor for exploration

ε-Greedy

With probability 1-ε: take greedy action (exploit)
With probability ε: take action uniformly at random (explore)
Simple and effective approach
Performance depends on value of ε and time horizon

Upper Confidence Bound (UCB)

Selects actions based on estimated value and uncertainty
Balances exploration and exploitation more effectively than ε-greedy
Less practical for general RL settings and nonstationary problems

Gradient Bandits

Learn numerical preferences H_t(a) for each action
Use softmax to convert preferences to probabilities
Update preferences using stochastic gradient ascent

Optimistic Initial Values

Set initial Q-values higher than expected rewards
Encourages exploration initially
Works well for stationary problems, less suitable for nonstationary ones

Learning Methods

Sample-Average Method

Q_t(a) is average of all rewards received when taking action a
Simple but requires unbounded memory

Incremental Implementation

Update estimates incrementally to reduce memory and computation
General update rule: New Estimate ← Old Estimate + Step Size[Target - Old Estimate]
[Target - Old Estimate] is the error

Constant Step-Size

For nonstationary problems, recent rewards should be weighted more heavily
Use constant step-size parameter α instead of 1/n
Creates exponentially recency-weighted average

Associative Search (Contextual Bandits)

Intermediate between n-armed bandit and full RL
Learn policy mapping situations to best actions
Actions affect immediate reward but not next situation

Mathematical Formulas

Q-value (True action value):
$$$q_*(a)=E[R_t|A_t=a]$$$
Estimated Q-value (Sample average):
$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$
Formal estimated Q-value:
$Q_t(a) = \frac{\sum^{t-1}_{i=1}R_i * I_{A_{i=a}}}{\sum^{t-1}_{i=1}I_{A_{i=a}}}$
Incremental Q-value update:
$Q_{n+1}=Q_n + \frac{1}{n}[R_n - Q_n]$
General update rule:
$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize}[\text{Target} - \text{OldEstimate}]$
Constant step-size update:
$Q_{n+1} = Q_n + \alpha[R_n-Q_n]$
Expanded constant step-size update:
$Q_{k+1} = (1-\alpha)^kQ_1 + \sum^{k}_{i=1}\alpha(1-\alpha)^{k-i}R_i$
Convergence conditions:
$\sum^\infty_{n=1}\alpha_n(a)=\infty \text{ and } \sum^\infty_{n=1} \alpha^2_n(a) < \infty$
UCB action selection:
$A_t = \text{argmax}_a[Q_t(a)+c \sqrt{\frac{\ln t}{N_t(a)}}]$
Softmax action selection probability:
$Pr\{A_t=a\}=\frac{e^{H_t(a)}}{\sum^{n}_{b=1}e^{H_t(b)}} = \pi_t(a)$
Gradient bandit update (selected action):
$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$ 0
Gradient bandit update (non-selected actions):
$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$ 1
Gradient ascent update:
$H_{t+1}(a)=H_t(a) + \alpha \frac{\partial E[R_t]}{\partial H_t(a)}$
Expected reward:
$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$ 3
Partial derivative for gradient bandit:
$\frac{\partial \pi_t(b)}{\partial H_t(a)} = \pi_t(b)(I_{a=b}-\pi_t(a))$