Reinforcement Learning: An Introduction - Midterm 2 Review

By Tyler Gilman, CITING: (Richard S. Sutton and Andrew G. Barto) on Apr 7, 2025

Reinforcement Learning Exam Study Sheet

Date: April 07, 2025
Textbook: Reinforcement Learning: An Introduction (2nd ed.) by Sutton & Barto
Focus: Midterm 1 Questions (Q3, Q4, Q5, Q6) + Deep Learning + DQN + DDPG/TD3

Midterm 1 Questions

Q3: Bandit Problem

Setup:
- Q-values initialized to 0.
- Uses sample averaging: $Q(a) \leftarrow Q(a) + \frac{1}{n} (r - Q(a))$ .
Task:
- Complete a table tracking Q-values after each action and reward.
Key Concept:
- Greedy: Pick action with highest Q-value from previous step.
- Non-Greedy: Explore other actions (e.g., via $\epsilon$ -greedy).

Q4: MDP with Exit States

Setup:
- 3 exit states; must execute “exit” action to leave.
- Values after 1st time step: +2, +8, +1.
Key Points:
- Optimal value of state A = $$+8$$ .
- Non-zero after 3rd time step.
- Reaches optimal value by 4th time step (depends on $\gamma$ , specified in question).
- Timeline:
  - Time step 0: All values = 0.
  - Time step 1: +2, +8, +1 appear.

Q5: Potential Repeat

Note: Review Q5 from Exam 1 (details not provided—check notes).
Task: Likely a conceptual or computational question; prepare for slight variation.

Q6: Solve for $$y$$ Based on $$x$$

Setup:
- New, more challenging version than Exam 1.
Task:
- Compute $$y$$ (e.g., NN output or Q-value) given $$x$$ (input or state).
Tip: Practice function application from class examples.

Deep Learning (DL)

Single Unit Neural Network

Output: $y = f(w \cdot x + b)$
- $$f$$ : Activation function.
- $\epsilon$ 0: Weight, $\epsilon$ 1: Input, $$b$$ : Bias.
ReLU: $\text{ReLU}(z) = \max(0, z)$ .

Convolutional Layer

Mechanics: Filters slide over input to produce feature maps.
Input/Output Dimensions:
- Input: $$x$$ units; Output: $\epsilon$ 5 units.
- Understand effect of filter size, stride, padding (no formula memorization).
Key Diagram:
- Study convolution picture (expect variation on exam).
- Focus: Filter application, stride, output shape.

AlexNet

Memorize:
- Architecture: Layers, filter sizes, strides, pooling.
- Details from slide.

One-Layer Evaluation

Vector Activation: $\epsilon$6
- $\epsilon$ 7: Weight matrix, $$x$$ : Input vector, $$f$$ : Activation (e.g., ReLU).

Deep Q-Networks (DQN)

State Space & Q-Network

Format:
- State: e.g., raw pixels or feature vectors.
- Actions: Discrete set.
Q-Network:
- Input: State.
- Hidden: Fully connected or convolutional layers.
- Output: Q-values for each action.

Forward Passes & Training

Forward Passes:
- 1 per action selection + 1 per target computation.
Training:
- Once per step, using mini-batch from replay buffer.

DQN Problem Outline

Define: State space, action space, reward function.
NN Structure: e.g., Conv layers → Dense layers → Q-values.

Loss Function & TD Error

L2 Loss:
$$ L = \frac{1}{2} \left( Q(s, a; \theta) - y \right)^2 $$
- $$+8$$ 0: TD target.
TD Target: $y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$
- $\theta^-$ : Target network parameters.
TD Error: $$+8$$ 3

Hyperparameters

Design Choices:
- Target network update frequency (e.g., every $$C$$ steps).
- Training frequency (e.g., every step).
- Replay buffer size.
- Mini-batch size.
- Learning rate, $$+8$$ 5, $$+8$$ 6.

Double DQN

TD Target: $$+8$$ 7
Purpose: Reduces overestimation.
Why: Online network selects action; target network evaluates.

Dueling Networks

Structure: $Q(s, a) = V(s) + \left( A(s, a) - \frac{1}{|A|} \sum_{a'} A(s, a') \right)$
Memorize: Slide details (architecture, intuition).

DDPG & TD3

Shared Concepts

Gradient:
- Actor: Ascent on $\nabla_{\mu} J \approx \nabla_a Q(s, a; \theta) \cdot \nabla_{\mu} a$ .
- Critic: Descent on Q-function parameters.
L2 Loss:
- Supervised: $L = \frac{1}{2} (y_{\text{pred}} - y_{\text{true}})^2$
- DDPG: $L = \frac{1}{2} (Q(s, a; \theta) - (r + \gamma Q(s', \mu(s'; \phi^-); \theta^-)))^2$
TD Error:
- DDPG: $$+8$$ 9.
- DQN: $\gamma$ 0.
Exploration: DDPG uses noise (e.g., OU noise).
Experience Replay: Buffer stores $$(s, a, r, s')$$ .

DDPG

Actor-Critic:
- Actor: $\mu(s; \phi)$ (deterministic policy).
- Critic: $Q(s, a; \theta)$ .

TD3

Distinctions:
- Twin Q-networks ( $$Q_1$$ , $$Q_2$$ ); use $\min(Q_1, Q_2)$ .
- Delayed policy/target updates.
- Target policy smoothing (noise on target actions).
Target Updates:
- DQN: Hard/soft updates.
- TD3: Soft updates ( $\tau \approx 0.005$ ), delayed policy updates.

Quick Review Checklist

Memorize:
- AlexNet architecture.
- $$L2$$ , TD error, TD target formulas.
- Dueling Networks slide.
Practice:
- Single-unit NN output.
- Convolution tracing.
- DQN/DDPG algorithm outlines.
Understand:
- Double DQN’s $\max$ logic.
- TD3’s twin Q-functions.
- Gradient ascent intuition.