Finite Markov Decision Processes (RL Intro Chapter 3)

By Tyler Gilman, CITING: (Richard S. Sutton and Andrew G. Barto) on Apr 7, 2025

Summary of Chapter 3: Finite Markov Decision Processes

This chapter introduces Finite Markov Decision Processes (MDPs), which provide the formal framework for reinforcement learning problems. MDPs extend the bandit problems from the previous chapter by adding states and sequential decision-making, where actions affect not just immediate rewards but also future states and rewards.

Key Concepts

Agent-Environment Interface

  • The agent interacts with the environment by selecting actions, receiving rewards, and transitioning to new states
  • At each time step t, the agent:
    • Observes state $S_t$
    • Selects action $A_t$
    • Receives reward $R_{t+1}$
    • Transitions to state $S_{t+1}$

MDP Dynamics

  • The dynamics of an MDP are defined by the probability function $p(s',r|s,a)$, which gives the probability of transitioning to state $s'$ with reward $r$ when taking action $a$ in state $s$
  • From this four-argument function, we can derive:
    • State transition probabilities $p(s'|s,a)$
    • Expected rewards $r(s,a)$ and $A_t$1

Returns and Episodes

  • The return $A_t$2 is the cumulative reward the agent aims to maximize
  • For episodic tasks (with terminal states), return is the sum of rewards until termination
  • For continuing tasks (without natural endpoints), discounted return is used with discount factor $A_t$3

Policies and Value Functions

  • A policy $A_t$4 is a mapping from states to probabilities of selecting actions
  • State-value function $v_\pi(s)$ is the expected return starting from state $A_t$6 and following policy $A_t$7
  • Action-value function $A_t$8 is the expected return starting from state $A_t$9, taking action $a$, and following policy $R_{t+1}$1
  • Both value functions satisfy recursive Bellman equations

Optimal Policies and Value Functions

  • Optimal policy $R_{t+1}$2 maximizes the expected return from all states
  • Optimal state-value function $R_{t+1}$3 is the maximum value achievable from state $s$
  • Optimal action-value function $R_{t+1}$5 is the maximum value achievable by taking action $a$ in state $s$ and following optimal policy afterward
  • These optimal value functions satisfy the Bellman optimality equations

Important Equations

  1. MDP Dynamics (3.2) $R_{t+1}$8

  2. State Transition Probabilities (3.4) $R_{t+1}$9

  3. Expected Rewards for State-Action Pairs (3.5) $S_{t+1}$0

  4. Expected Rewards for State-Action-Next-State Triples (3.6) $S_{t+1}$1

  5. Episodic Return (3.7) $G_t = R_{t+1} + R_{t+2} + R_{t+3} + ... + R_T$

  6. Discounted Return (3.8) $S_{t+1}$3

  7. Recursive Return Relationship (3.9) $S_{t+1}$4

  8. State-Value Function (3.12) $S_{t+1}$5

  9. Action-Value Function (3.13) $S_{t+1}$6

  10. Bellman Equation for State Values (3.14) $S_{t+1}$7

  11. Optimal State-Value Function (3.15) $v_*(s) = \max_\pi v_\pi(s)$

  12. Optimal Action-Value Function (3.16) $S_{t+1}$9

  13. Relationship Between v* and q* (3.17) $p(s',r|s,a)$0

  14. Bellman Optimality Equation for State Values (3.19) $p(s',r|s,a)$1

  15. Bellman Optimality Equation for Action Values (3.20) $p(s',r|s,a)$2

Noteworthy Problems

  1. Exercise 3.11: Expressing expected reward in terms of policy $p(s',r|s,a)$3 and dynamics $p$
  2. Exercise 3.12: Equation for $p(s',r|s,a)$5 in terms of $q_\pi$ and $\pi$
  3. Exercise 3.13: Equation for $p(s',r|s,a)$8 in terms of $p(s',r|s,a)$9 and $s'$0
  4. Exercise 3.17: Deriving the Bellman equation for action values
  5. Exercise 3.23: Bellman equation for $s'$1 for the recycling robot example
  6. Exercise 3.25-3.28: Relationships between $s'$2, $s'$3, and $s'$4
  7. Exercise 3.29: Rewriting Bellman equations using alternative forms of $p$ and $r$

The chapter emphasizes that while MDPs provide an ideal mathematical framework, real-world applications often require approximations due to computational limitations, memory constraints, and incomplete knowledge of the environment.