Multi-armed Bandits (RL Intro Chapter 2)
Organized Notes on Multi-Armed Bandits and Reinforcement Learning
Core Concepts
Multi-Armed Bandit Problem
- k levers, at each time step t = 1, 2, 3… choose to pull one lever A_t from set of k possible actions
- Receive reward R_t after taking A_t
- R_t is sampled IID from payout distribution of arm A_t
- Independence: pull of a particular machine is independent of previous pulls on the same machine
Bandit vs. Reinforcement Learning
- RL agent reasons about long-term consequences (bandit optimizes immediate return)
- Bandit has no state (though contextual bandit version exists)
- Contextual bandit is RL with single timestep reward
Q-values (Action Values)
- True value of action a: q_*(a) - expected reward when taking action a
- Estimated value: Q_t(a) - approximation based on experience
- As t approaches infinity, Qt(a) converges to q*(a) (with Monte Carlo algorithms)
Exploration vs. Exploitation
- Exploitation: choose actions with highest estimated value (greedy)
- Exploration: choose other actions to improve estimates
- Need to balance both for optimal performance
Action Selection Methods
Greedy Selection
- Always choose action with highest estimated value
- A_t = argmax_a Q_t(a)
- Good for exploitation, poor for exploration
ε-Greedy
- With probability 1-ε: take greedy action (exploit)
- With probability ε: take action uniformly at random (explore)
- Simple and effective approach
- Performance depends on value of ε and time horizon
Upper Confidence Bound (UCB)
- Selects actions based on estimated value and uncertainty
- Balances exploration and exploitation more effectively than ε-greedy
- Less practical for general RL settings and nonstationary problems
Gradient Bandits
- Learn numerical preferences H_t(a) for each action
- Use softmax to convert preferences to probabilities
- Update preferences using stochastic gradient ascent
Optimistic Initial Values
- Set initial Q-values higher than expected rewards
- Encourages exploration initially
- Works well for stationary problems, less suitable for nonstationary ones
Learning Methods
Sample-Average Method
- Q_t(a) is average of all rewards received when taking action a
- Simple but requires unbounded memory
Incremental Implementation
- Update estimates incrementally to reduce memory and computation
- General update rule: New Estimate ← Old Estimate + Step Size[Target - Old Estimate]
- [Target - Old Estimate] is the error
Constant Step-Size
- For nonstationary problems, recent rewards should be weighted more heavily
- Use constant step-size parameter α instead of 1/n
- Creates exponentially recency-weighted average
Associative Search (Contextual Bandits)
- Intermediate between n-armed bandit and full RL
- Learn policy mapping situations to best actions
- Actions affect immediate reward but not next situation
Mathematical Formulas
Q-value (True action value):
$$q_*(a)=E[R_t|A_t=a]$$Estimated Q-value (Sample average):
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$Formal estimated Q-value:
$$Q_t(a) = \frac{\sum^{t-1}_{i=1}R_i * I_{A_{i=a}}}{\sum^{t-1}_{i=1}I_{A_{i=a}}}$$Incremental Q-value update:
$$Q_{n+1}=Q_n + \frac{1}{n}[R_n - Q_n]$$General update rule:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize}[\text{Target} - \text{OldEstimate}]$$Constant step-size update:
$$Q_{n+1} = Q_n + \alpha[R_n-Q_n]$$Expanded constant step-size update:
$$Q_{k+1} = (1-\alpha)^kQ_1 + \sum^{k}_{i=1}\alpha(1-\alpha)^{k-i}R_i$$Convergence conditions:
$$\sum^\infty_{n=1}\alpha_n(a)=\infty \text{ and } \sum^\infty_{n=1} \alpha^2_n(a) < \infty$$UCB action selection:
$$A_t = \text{argmax}_a[Q_t(a)+c \sqrt{\frac{\ln t}{N_t(a)}}]$$Softmax action selection probability:
$$Pr\{A_t=a\}=\frac{e^{H_t(a)}}{\sum^{n}_{b=1}e^{H_t(b)}} = \pi_t(a)$$Gradient bandit update (selected action):
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$0Gradient bandit update (non-selected actions):
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$1Gradient ascent update:
$$H_{t+1}(a)=H_t(a) + \alpha \frac{\partial E[R_t]}{\partial H_t(a)}$$Expected reward:
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$3Partial derivative for gradient bandit:
$$\frac{\partial \pi_t(b)}{\partial H_t(a)} = \pi_t(b)(I_{a=b}-\pi_t(a))$$