Reinforcement Learning: An Introduction - Midterm 2 Review
Reinforcement Learning Exam Study Sheet
Date: April 07, 2025
Textbook: Reinforcement Learning: An Introduction (2nd ed.) by Sutton & Barto
Focus: Midterm 1 Questions (Q3, Q4, Q5, Q6) + Deep Learning + DQN + DDPG/TD3
Midterm 1 Questions
Q3: Bandit Problem
- Setup:
- Q-values initialized to 0.
- Uses sample averaging: $Q(a) \leftarrow Q(a) + \frac{1}{n} (r - Q(a))$.
- Q-values initialized to 0.
- Task:
- Complete a table tracking Q-values after each action and reward.
- Complete a table tracking Q-values after each action and reward.
- Key Concept:
- Greedy: Pick action with highest Q-value from previous step.
- Non-Greedy: Explore other actions (e.g., via $\epsilon$-greedy).
- Greedy: Pick action with highest Q-value from previous step.
Q4: MDP with Exit States
- Setup:
- 3 exit states; must execute “exit” action to leave.
- Values after 1st time step: +2, +8, +1.
- 3 exit states; must execute “exit” action to leave.
- Key Points:
- Optimal value of state A = $+8$.
- Non-zero after 3rd time step.
- Reaches optimal value by 4th time step (depends on $\gamma$, specified in question).
- Timeline:
- Time step 0: All values = 0.
- Time step 1: +2, +8, +1 appear.
- Time step 0: All values = 0.
- Optimal value of state A = $+8$.
Q5: Potential Repeat
- Note: Review Q5 from Exam 1 (details not provided—check notes).
- Task: Likely a conceptual or computational question; prepare for slight variation.
Q6: Solve for $y$ Based on $x$
- Setup:
- New, more challenging version than Exam 1.
- New, more challenging version than Exam 1.
- Task:
- Compute $y$ (e.g., NN output or Q-value) given $x$ (input or state).
- Compute $y$ (e.g., NN output or Q-value) given $x$ (input or state).
- Tip: Practice function application from class examples.
Deep Learning (DL)
Single Unit Neural Network
- Output: $y = f(w \cdot x + b)$
- $f$: Activation function.
- $\epsilon$0: Weight, $\epsilon$1: Input, $b$: Bias.
- $f$: Activation function.
- ReLU: $\text{ReLU}(z) = \max(0, z)$.
Convolutional Layer
- Mechanics: Filters slide over input to produce feature maps.
- Input/Output Dimensions:
- Input: $x$ units; Output: $\epsilon$5 units.
- Understand effect of filter size, stride, padding (no formula memorization).
- Input: $x$ units; Output: $\epsilon$5 units.
- Key Diagram:
- Study convolution picture (expect variation on exam).
- Focus: Filter application, stride, output shape.
- Study convolution picture (expect variation on exam).
AlexNet
- Memorize:
- Architecture: Layers, filter sizes, strides, pooling.
- Details from slide.
- Architecture: Layers, filter sizes, strides, pooling.
One-Layer Evaluation
- Vector Activation: $\epsilon$6
- $\epsilon$7: Weight matrix, $x$: Input vector, $f$: Activation (e.g., ReLU).
- $\epsilon$7: Weight matrix, $x$: Input vector, $f$: Activation (e.g., ReLU).
Deep Q-Networks (DQN)
State Space & Q-Network
- Format:
- State: e.g., raw pixels or feature vectors.
- Actions: Discrete set.
- State: e.g., raw pixels or feature vectors.
- Q-Network:
- Input: State.
- Hidden: Fully connected or convolutional layers.
- Output: Q-values for each action.
- Input: State.
Forward Passes & Training
- Forward Passes:
- 1 per action selection + 1 per target computation.
- 1 per action selection + 1 per target computation.
- Training:
- Once per step, using mini-batch from replay buffer.
- Once per step, using mini-batch from replay buffer.
DQN Problem Outline
- Define: State space, action space, reward function.
- NN Structure: e.g., Conv layers → Dense layers → Q-values.
Loss Function & TD Error
- L2 Loss:
$$ L = \frac{1}{2} \left( Q(s, a; \theta) - y \right)^2 $$
- $+8$0: TD target.
- $+8$0: TD target.
- TD Target: $y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$
- $\theta^-$: Target network parameters.
- $\theta^-$: Target network parameters.
- TD Error: $+8$3
Hyperparameters
- Design Choices:
- Target network update frequency (e.g., every $C$ steps).
- Training frequency (e.g., every step).
- Replay buffer size.
- Mini-batch size.
- Learning rate, $+8$5, $+8$6.
- Target network update frequency (e.g., every $C$ steps).
Double DQN
- TD Target: $+8$7
- Purpose: Reduces overestimation.
- Why: Online network selects action; target network evaluates.
Dueling Networks
- Structure:
$$ Q(s, a) = V(s) + \left( A(s, a) - \frac{1}{|A|} \sum_{a'} A(s, a') \right) $$
- Memorize: Slide details (architecture, intuition).
DDPG & TD3
Shared Concepts
- Gradient:
- Actor: Ascent on $\nabla_{\mu} J \approx \nabla_a Q(s, a; \theta) \cdot \nabla_{\mu} a$.
- Critic: Descent on Q-function parameters.
- Actor: Ascent on $\nabla_{\mu} J \approx \nabla_a Q(s, a; \theta) \cdot \nabla_{\mu} a$.
- L2 Loss:
- Supervised:
$$ L = \frac{1}{2} (y_{\text{pred}} - y_{\text{true}})^2 $$
- DDPG:
$$ L = \frac{1}{2} (Q(s, a; \theta) - (r + \gamma Q(s', \mu(s'; \phi^-); \theta^-)))^2 $$
- Supervised:
- TD Error:
- DDPG: $+8$9.
- DQN: $\gamma$0.
- DDPG: $+8$9.
- Exploration: DDPG uses noise (e.g., OU noise).
- Experience Replay: Buffer stores $(s, a, r, s')$.
DDPG
- Actor-Critic:
- Actor: $\mu(s; \phi)$ (deterministic policy).
- Critic: $Q(s, a; \theta)$.
- Actor: $\mu(s; \phi)$ (deterministic policy).
TD3
- Distinctions:
- Twin Q-networks ($Q_1$, $Q_2$); use $\min(Q_1, Q_2)$.
- Delayed policy/target updates.
- Target policy smoothing (noise on target actions).
- Twin Q-networks ($Q_1$, $Q_2$); use $\min(Q_1, Q_2)$.
- Target Updates:
- DQN: Hard/soft updates.
- TD3: Soft updates ($\tau \approx 0.005$), delayed policy updates.
- DQN: Hard/soft updates.
Quick Review Checklist
- Memorize:
- AlexNet architecture.
- $L2$, TD error, TD target formulas.
- Dueling Networks slide.
- AlexNet architecture.
- Practice:
- Single-unit NN output.
- Convolution tracing.
- DQN/DDPG algorithm outlines.
- Single-unit NN output.
- Understand:
- Double DQN’s $\max$ logic.
- TD3’s twin Q-functions.
- Gradient ascent intuition.
- Double DQN’s $\max$ logic.