Reinforcement Learning: An Introduction - Formulas

By Tyler Gilman, CITING: (Richard S. Sutton and Andrew G. Barto) on Apr 7, 2025

RL Programming

Multi-armed Bandits (Chapter 2)

- Q-value (True action value):

$$$q_*(a) = E[R_t | A_t = a]$$$
(2.2) Estimated Q-value (Sample average):

$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$
(2.3) Formal estimated Q-value:

$Q_t(a) = \frac{\sum^{t-1}_{i=1} R_i \cdot I_{A_{i=a}}}{\sum^{t-1}_{i=1} I_{A_{i=a}}}$
(2.4) Incremental Q-value update:

$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$
(2.5) General update rule:

$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$
(2.6) Constant step-size update:

$Q_{n+1} = Q_n + \alpha [R_n - Q_n]$
(2.7) Expanded constant step-size update:

$Q_{k+1} = (1 - \alpha)^k Q_1 + \sum^{k}_{i=1} \alpha (1 - \alpha)^{k-i} R_i$
(2.8) Convergence conditions:

$\sum^\infty_{n=1} \alpha_n(a) = \infty \quad \text{and} \quad \sum^\infty_{n=1} \alpha^2_n(a) < \infty$
(2.9) UCB action selection:

$A_t = \text{argmax}_a \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$
(2.10) Softmax action selection probability:

$Pr\{A_t = a\} = \frac{e^{H_t(a)}}{\sum^{n}_{b=1} e^{H_t(b)}} = \pi_t(a)$
(2.11) Gradient bandit update (selected action):

$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$ 0
(2.12) Gradient bandit update (non-selected actions):

$H_{t+1}(a) = H_t(a) - \alpha (R_t - \overline{R}_t) \pi_t(a), \quad \forall a \neq A_t$
(2.13) Gradient ascent update:

$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$ 2
(2.14) Expected reward:

$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$ 3
(2.15) Partial derivative for gradient bandit:

$\frac{\partial \pi_t(b)}{\partial H_t(a)} = \pi_t(b) (I_{a=b} - \pi_t(a))$

Markov Decision Processes (Chapter 3)

(3.2) MDP Dynamics:

$p(s', r | s, a) = Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a\}$
(3.4) State Transition Probabilities:

$p(s' | s, a) = \sum_{r \in R} p(s', r | s, a)$
(3.5) Expected Rewards for State-Action Pairs:

$r(s, a) = \sum_{r \in R} \sum_{s' \in S} r \cdot p(s', r | s, a)$
(3.6) Expected Rewards for State-Action-Next-State Triples:

$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$ 8
(3.7) Episodic Return:

$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T$
(3.8) Discounted Return:

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$
(3.9) Recursive Return Relationship:

$G_t = R_{t+1} + \gamma G_{t+1}$
(3.12) State-Value Function:

$Q_t(a) = \frac{\sum^{t-1}_{i=1} R_i \cdot I_{A_{i=a}}}{\sum^{t-1}_{i=1} I_{A_{i=a}}}$ 2
(3.13) Action-Value Function:

$q_\pi(s, a) = E_\pi[G_t | S_t = s, A_t = a] = E_\pi\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s, A_t = a \right]$
(3.14) Bellman Equation for State Values:

$v_\pi(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$
(3.15) Optimal State-Value Function:

$v_*(s) = \max_\pi v_\pi(s)$
(3.16) Optimal Action-Value Function:

$q_*(s, a) = \max_\pi q_\pi(s, a)$
(3.17) Relationship Between (v*) and (q*):

$q_*(s, a) = E[R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a]$
(3.19) Bellman Optimality Equation for State Values:

$Q_t(a) = \frac{\sum^{t-1}_{i=1} R_i \cdot I_{A_{i=a}}}{\sum^{t-1}_{i=1} I_{A_{i=a}}}$ 8
(3.20) Bellman Optimality Equation for Action Values:

$q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]$

Dynamic Programming (Chapter 4)

(4.1) Bellman Optimality Equation for State Values:

$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$ 0
(4.2) Bellman Optimality Equation for Action Values:

$q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') | S_t = s, A_t = a] = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]$
(4.3, 4.4) Policy Evaluation Bellman Equation:

$v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s] = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$
(4.5) Iterative Policy Evaluation Update:

$v_{k+1}(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]$
(4.6) Action-Value for Policy Evaluation:

$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$ 4
(4.7) Policy Improvement Condition:

$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$ 5
(4.9) Greedy Policy Definition:

$\pi'(s) = \arg\max_a q_\pi(s, a) = \arg\max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$
(4.10) Value Iteration Update:

$v_{k+1}(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]$

Monte Carlo Methods (Chapter 5)

(5.1) Greedy Policy from Action-Values:

$\pi(s) = \arg\max_a q(s, a)$
(5.2) Policy Improvement Equation:

$q_\pi(s, \pi'(s)) = \frac{\epsilon}{|A(s)|} \sum_a q_\pi(s, a) + (1 - \epsilon) \max_a q_\pi(s, a)$
(5.3) Importance Sampling Ratio:

$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k | S_k)}{b(A_k | S_k)}$
(5.4) Expected Value under Target Policy:

$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$ 1
(5.5) Ordinary Importance Sampling Estimator:

$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$ 2
(5.6) Weighted Importance Sampling Estimator:

$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$ 3
(5.7, 5.8) Weighted Average Update Rule:

$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$ 4
(5.9) Discounting-aware Importance Sampling, Ordinary:

$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$ 5
(5.10) Discounting-aware Importance Sampling, Weighted:

$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$ 6
(5.11) Return Decomposition:

$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$ 7
(5.12-5.15) Per-decision Importance Sampling:

$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$ 8
$V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \tilde{G}_t}{|\mathcal{T}(s)|}$