Reinforcement Learning: An Introduction - Midterm 2 Review

By Tyler Gilman, CITING: (Richard S. Sutton and Andrew G. Barto) on Apr 7, 2025

Reinforcement Learning Exam Study Sheet

Date: April 07, 2025
Textbook: Reinforcement Learning: An Introduction (2nd ed.) by Sutton & Barto
Focus: Midterm 1 Questions (Q3, Q4, Q5, Q6) + Deep Learning + DQN + DDPG/TD3


Midterm 1 Questions

Q3: Bandit Problem

  • Setup:
    • Q-values initialized to 0.
    • Uses sample averaging: $Q(a) \leftarrow Q(a) + \frac{1}{n} (r - Q(a))$.
  • Task:
    • Complete a table tracking Q-values after each action and reward.
  • Key Concept:
    • Greedy: Pick action with highest Q-value from previous step.
    • Non-Greedy: Explore other actions (e.g., via $\epsilon$-greedy).

Q4: MDP with Exit States

  • Setup:
    • 3 exit states; must execute “exit” action to leave.
    • Values after 1st time step: +2, +8, +1.
  • Key Points:
    • Optimal value of state A = $+8$.
    • Non-zero after 3rd time step.
    • Reaches optimal value by 4th time step (depends on $\gamma$, specified in question).
    • Timeline:
      • Time step 0: All values = 0.
      • Time step 1: +2, +8, +1 appear.

Q5: Potential Repeat

  • Note: Review Q5 from Exam 1 (details not provided—check notes).
  • Task: Likely a conceptual or computational question; prepare for slight variation.

Q6: Solve for $y$ Based on $x$

  • Setup:
    • New, more challenging version than Exam 1.
  • Task:
    • Compute $y$ (e.g., NN output or Q-value) given $x$ (input or state).
  • Tip: Practice function application from class examples.

Deep Learning (DL)

Single Unit Neural Network

  • Output: $y = f(w \cdot x + b)$
    • $f$: Activation function.
    • $\epsilon$0: Weight, $\epsilon$1: Input, $b$: Bias.
  • ReLU: $\text{ReLU}(z) = \max(0, z)$.

Convolutional Layer

  • Mechanics: Filters slide over input to produce feature maps.
  • Input/Output Dimensions:
    • Input: $x$ units; Output: $\epsilon$5 units.
    • Understand effect of filter size, stride, padding (no formula memorization).
  • Key Diagram:
    • Study convolution picture (expect variation on exam).
    • Focus: Filter application, stride, output shape.

AlexNet

  • Memorize:
    • Architecture: Layers, filter sizes, strides, pooling.
    • Details from slide.

One-Layer Evaluation

  • Vector Activation: $\epsilon$6
    • $\epsilon$7: Weight matrix, $x$: Input vector, $f$: Activation (e.g., ReLU).

Deep Q-Networks (DQN)

State Space & Q-Network

  • Format:
    • State: e.g., raw pixels or feature vectors.
    • Actions: Discrete set.
  • Q-Network:
    • Input: State.
    • Hidden: Fully connected or convolutional layers.
    • Output: Q-values for each action.

Forward Passes & Training

  • Forward Passes:
    • 1 per action selection + 1 per target computation.
  • Training:
    • Once per step, using mini-batch from replay buffer.

DQN Problem Outline

  • Define: State space, action space, reward function.
  • NN Structure: e.g., Conv layers → Dense layers → Q-values.

Loss Function & TD Error

  • L2 Loss:
    $$ L = \frac{1}{2} \left( Q(s, a; \theta) - y \right)^2 $$
    • $+8$0: TD target.
  • TD Target: $y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$
    • $\theta^-$: Target network parameters.
  • TD Error: $+8$3

Hyperparameters

  • Design Choices:
    • Target network update frequency (e.g., every $C$ steps).
    • Training frequency (e.g., every step).
    • Replay buffer size.
    • Mini-batch size.
    • Learning rate, $+8$5, $+8$6.

Double DQN

  • TD Target: $+8$7
  • Purpose: Reduces overestimation.
  • Why: Online network selects action; target network evaluates.

Dueling Networks

  • Structure:
    $$ Q(s, a) = V(s) + \left( A(s, a) - \frac{1}{|A|} \sum_{a'} A(s, a') \right) $$

  • Memorize: Slide details (architecture, intuition).

DDPG & TD3

Shared Concepts

  • Gradient:
    • Actor: Ascent on $\nabla_{\mu} J \approx \nabla_a Q(s, a; \theta) \cdot \nabla_{\mu} a$.
    • Critic: Descent on Q-function parameters.
  • L2 Loss:
    • Supervised:
      $$ L = \frac{1}{2} (y_{\text{pred}} - y_{\text{true}})^2 $$

    • DDPG:
      $$ L = \frac{1}{2} (Q(s, a; \theta) - (r + \gamma Q(s', \mu(s'; \phi^-); \theta^-)))^2 $$

  • TD Error:
    • DDPG: $+8$9.
    • DQN: $\gamma$0.
  • Exploration: DDPG uses noise (e.g., OU noise).
  • Experience Replay: Buffer stores $(s, a, r, s')$.

DDPG

  • Actor-Critic:
    • Actor: $\mu(s; \phi)$ (deterministic policy).
    • Critic: $Q(s, a; \theta)$.

TD3

  • Distinctions:
    • Twin Q-networks ($Q_1$, $Q_2$); use $\min(Q_1, Q_2)$.
    • Delayed policy/target updates.
    • Target policy smoothing (noise on target actions).
  • Target Updates:
    • DQN: Hard/soft updates.
    • TD3: Soft updates ($\tau \approx 0.005$), delayed policy updates.

Quick Review Checklist

  • Memorize:
    • AlexNet architecture.
    • $L2$, TD error, TD target formulas.
    • Dueling Networks slide.
  • Practice:
    • Single-unit NN output.
    • Convolution tracing.
    • DQN/DDPG algorithm outlines.
  • Understand:
    • Double DQN’s $\max$ logic.
    • TD3’s twin Q-functions.
    • Gradient ascent intuition.