Reinforcement Learning: An Introduction - Formulas
Multi-armed Bandits (Chapter 2)
- Q-value (True action value):
$$q_*(a) = E[R_t | A_t = a]$$(2.2) Estimated Q-value (Sample average):
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$(2.3) Formal estimated Q-value:
$$Q_t(a) = \frac{\sum^{t-1}_{i=1} R_i \cdot I_{A_{i=a}}}{\sum^{t-1}_{i=1} I_{A_{i=a}}}$$(2.4) Incremental Q-value update:
$$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$(2.5) General update rule:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$(2.6) Constant step-size update:
$$Q_{n+1} = Q_n + \alpha [R_n - Q_n]$$(2.7) Expanded constant step-size update:
$$Q_{k+1} = (1 - \alpha)^k Q_1 + \sum^{k}_{i=1} \alpha (1 - \alpha)^{k-i} R_i$$(2.8) Convergence conditions:
$$\sum^\infty_{n=1} \alpha_n(a) = \infty \quad \text{and} \quad \sum^\infty_{n=1} \alpha^2_n(a) < \infty$$(2.9) UCB action selection:
$$A_t = \text{argmax}_a \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$$(2.10) Softmax action selection probability:
$$Pr\{A_t = a\} = \frac{e^{H_t(a)}}{\sum^{n}_{b=1} e^{H_t(b)}} = \pi_t(a)$$(2.11) Gradient bandit update (selected action):
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$0(2.12) Gradient bandit update (non-selected actions):
$$H_{t+1}(a) = H_t(a) - \alpha (R_t - \overline{R}_t) \pi_t(a), \quad \forall a \neq A_t$$(2.13) Gradient ascent update:
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$2(2.14) Expected reward:
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$3(2.15) Partial derivative for gradient bandit:
$$\frac{\partial \pi_t(b)}{\partial H_t(a)} = \pi_t(b) (I_{a=b} - \pi_t(a))$$
Markov Decision Processes (Chapter 3)
(3.2) MDP Dynamics:
$$p(s', r | s, a) = Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a\}$$(3.4) State Transition Probabilities:
$$p(s' | s, a) = \sum_{r \in R} p(s', r | s, a)$$(3.5) Expected Rewards for State-Action Pairs:
$$r(s, a) = \sum_{r \in R} \sum_{s' \in S} r \cdot p(s', r | s, a)$$(3.6) Expected Rewards for State-Action-Next-State Triples:
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$8(3.7) Episodic Return:
$$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T$$(3.8) Discounted Return:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$(3.9) Recursive Return Relationship:
$$G_t = R_{t+1} + \gamma G_{t+1}$$(3.12) State-Value Function:
$$Q_t(a) = \frac{\sum^{t-1}_{i=1} R_i \cdot I_{A_{i=a}}}{\sum^{t-1}_{i=1} I_{A_{i=a}}}$$2(3.13) Action-Value Function:
$$q_\pi(s, a) = E_\pi[G_t | S_t = s, A_t = a] = E_\pi\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s, A_t = a \right]$$(3.14) Bellman Equation for State Values:
$$v_\pi(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$$(3.15) Optimal State-Value Function:
$$v_*(s) = \max_\pi v_\pi(s)$$(3.16) Optimal Action-Value Function:
$$q_*(s, a) = \max_\pi q_\pi(s, a)$$(3.17) Relationship Between (v*) and (q*):
$$q_*(s, a) = E[R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a]$$(3.19) Bellman Optimality Equation for State Values:
$$Q_t(a) = \frac{\sum^{t-1}_{i=1} R_i \cdot I_{A_{i=a}}}{\sum^{t-1}_{i=1} I_{A_{i=a}}}$$8(3.20) Bellman Optimality Equation for Action Values:
$$q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]$$
Dynamic Programming (Chapter 4)
(4.1) Bellman Optimality Equation for State Values:
$$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$0(4.2) Bellman Optimality Equation for Action Values:
$$q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') | S_t = s, A_t = a] = \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a')]$$(4.3, 4.4) Policy Evaluation Bellman Equation:
$$v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s] = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$$(4.5) Iterative Policy Evaluation Update:
$$v_{k+1}(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]$$(4.6) Action-Value for Policy Evaluation:
$$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$4(4.7) Policy Improvement Condition:
$$Q_{n+1} = Q_n + \frac{1}{n} [R_n - Q_n]$$5(4.9) Greedy Policy Definition:
$$\pi'(s) = \arg\max_a q_\pi(s, a) = \arg\max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$$(4.10) Value Iteration Update:
$$v_{k+1}(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]$$
Monte Carlo Methods (Chapter 5)
(5.1) Greedy Policy from Action-Values:
$$\pi(s) = \arg\max_a q(s, a)$$(5.2) Policy Improvement Equation:
$$q_\pi(s, \pi'(s)) = \frac{\epsilon}{|A(s)|} \sum_a q_\pi(s, a) + (1 - \epsilon) \max_a q_\pi(s, a)$$(5.3) Importance Sampling Ratio:
$$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k | S_k)}{b(A_k | S_k)}$$(5.4) Expected Value under Target Policy:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$1(5.5) Ordinary Importance Sampling Estimator:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$2(5.6) Weighted Importance Sampling Estimator:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$3(5.7, 5.8) Weighted Average Update Rule:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$4(5.9) Discounting-aware Importance Sampling, Ordinary:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$5(5.10) Discounting-aware Importance Sampling, Weighted:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$6(5.11) Return Decomposition:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$7(5.12-5.15) Per-decision Importance Sampling:
$$\text{NewEstimate} \leftarrow \text{OldEstimate} + \text{StepSize} [\text{Target} - \text{OldEstimate}]$$8
$$V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \tilde{G}_t}{|\mathcal{T}(s)|}$$