Reinforcement Learning: An Introduction - Formulas

By Tyler Gilman, CITING: (Richard S. Sutton and Andrew G. Barto) on Apr 7, 2025

Multi-armed Bandits (Chapter 2)

  • - Q-value (True action value):

    $$q_*(a) = E[R_t | A_t = a]$$

  • (2.2) Estimated Q-value (Sample average):

    $$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$

  • (2.3) Formal estimated Q-value:

    $$Q_t(a) = \frac{\sum^{t-1}_{i=1} R_i \cdot I_{A_{i=a}}}{\sum^{t-1}_{i=1} I_{A_{i=a}}}$$

  • (2.4) Incremental Q-value update:

    $$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$

  • (2.5) General update rule:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$

  • (2.6) Constant step-size update:

    $$Q_{n+1} = Q_n + \alpha [R_n - Q_n]$$

  • (2.7) Expanded constant step-size update:

    $$Q_{k+1} = (1 - \alpha)^k Q_1 + \sum^{k}_{i=1} \alpha (1 - \alpha)^{k-i} R_i$$

  • (2.8) Convergence conditions:

    $$\sum^\infty_{n=1} \alpha_n(a) = \infty \quad \text{and} \quad \sum^\infty_{n=1} \alpha^2_n(a) < \infty$$

  • (2.9) UCB action selection:

    $$A_t = \text{argmax}_a \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$$

  • (2.10) Softmax action selection probability:

    $$Pr\{A_t = a\} = \frac{e^{H_t(a)}}{\sum^{n}_{b=1} e^{H_t(b)}} = \pi_t(a)$$

  • (2.11) Gradient bandit update (selected action):

    $$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$
    0

  • (2.12) Gradient bandit update (non-selected actions):

    $$H_{t+1}(a) = H_t(a) - \alpha (R_t - \overline{R}_t) \pi_t(a), \quad \forall a \neq A_t$$

  • (2.13) Gradient ascent update:

    $$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$
    2

  • (2.14) Expected reward:

    $$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$
    3

  • (2.15) Partial derivative for gradient bandit:

    $$\frac{\partial \pi_t(b)}{\partial H_t(a)} = \pi_t(b) (I_{a=b} - \pi_t(a))$$

Markov Decision Processes (Chapter 3)

  • (3.2) MDP Dynamics:

    $$p(s', r | s, a) = Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a\}$$

  • (3.4) State Transition Probabilities:

    $$p(s' | s, a) = \sum_{r \in R} p(s', r | s, a)$$

  • (3.5) Expected Rewards for State-Action Pairs:

    $$r(s, a) = \sum_{r \in R} \sum_{s' \in S} r \cdot p(s', r | s, a)$$

  • (3.6) Expected Rewards for State-Action-Next-State Triples:

    $$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$
    8

  • (3.7) Episodic Return:

    $$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T$$

  • (3.8) Discounted Return:

    $$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

  • (3.9) Recursive Return Relationship:

    $$G_t = R_{t+1} + \gamma G_{t+1}$$

  • (3.12) State-Value Function:

    $$Q_t(a) = \frac{\sum^{t-1}_{i=1} R_i \cdot I_{A_{i=a}}}{\sum^{t-1}_{i=1} I_{A_{i=a}}}$$
    2

  • (3.13) Action-Value Function:

    $$q_\pi(s, a) = E_\pi[G_t | S_t = s, A_t = a] = E_\pi\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s, A_t = a \right]$$

  • (3.14) Bellman Equation for State Values:

    $$v_\pi(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$$

  • (3.15) Optimal State-Value Function:

    $$v_*(s) = \max_\pi v_\pi(s)$$

  • (3.16) Optimal Action-Value Function:

    $$q_*(s, a) = \max_\pi q_\pi(s, a)$$

  • (3.17) Relationship Between (v*) and (q*):

    $$q_*(s, a) = E[R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a]$$

  • (3.19) Bellman Optimality Equation for State Values:

    $$Q_t(a) = \frac{\sum^{t-1}_{i=1} R_i \cdot I_{A_{i=a}}}{\sum^{t-1}_{i=1} I_{A_{i=a}}}$$
    8

  • (3.20) Bellman Optimality Equation for Action Values:

    $$q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]$$

Dynamic Programming (Chapter 4)

  • (4.1) Bellman Optimality Equation for State Values:

    $$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$
    0

  • (4.2) Bellman Optimality Equation for Action Values:

    $$q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') | S_t = s, A_t = a] = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]$$

  • (4.3, 4.4) Policy Evaluation Bellman Equation:

    $$v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s] = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$$

  • (4.5) Iterative Policy Evaluation Update:

    $$v_{k+1}(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]$$

  • (4.6) Action-Value for Policy Evaluation:

    $$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$
    4

  • (4.7) Policy Improvement Condition:

    $$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$
    5

  • (4.9) Greedy Policy Definition:

    $$\pi'(s) = \arg\max_a q_\pi(s, a) = \arg\max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$$

  • (4.10) Value Iteration Update:

    $$v_{k+1}(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]$$

Monte Carlo Methods (Chapter 5)

  • (5.1) Greedy Policy from Action-Values:

    $$\pi(s) = \arg\max_a q(s, a)$$

  • (5.2) Policy Improvement Equation:

    $$q_\pi(s, \pi'(s)) = \frac{\epsilon}{|A(s)|} \sum_a q_\pi(s, a) + (1 - \epsilon) \max_a q_\pi(s, a)$$

  • (5.3) Importance Sampling Ratio:

    $$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k | S_k)}{b(A_k | S_k)}$$

  • (5.4) Expected Value under Target Policy:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$
    1

  • (5.5) Ordinary Importance Sampling Estimator:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$
    2

  • (5.6) Weighted Importance Sampling Estimator:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$
    3

  • (5.7, 5.8) Weighted Average Update Rule:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$
    4

  • (5.9) Discounting-aware Importance Sampling, Ordinary:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$
    5

  • (5.10) Discounting-aware Importance Sampling, Weighted:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$
    6

  • (5.11) Return Decomposition:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$
    7

  • (5.12-5.15) Per-decision Importance Sampling:

    $$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$
    8
    $$V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \tilde{G}_t}{|\mathcal{T}(s)|}$$