Module 4Reinforcement Learning
Temporal Difference Learning
Blend sampling and bootstrapping into practical value learning.
Why this module matters
TD methods sit at the center of classic RL and explain why bootstrapping changes stability and speed.
Prerequisites
- ▸ Monte Carlo basics
Learning objectives
- ▸ Compare TD(0), SARSA, and Q-learning
- ▸ Understand bootstrap targets
- ▸ Reason about on-policy vs off-policy
Core concepts
Bootstrapping
TD target
On-policy vs off-policy
Hands-on practice
- ▸ Train SARSA and Q-learning on Taxi-v3
Expected output
A comparison of learning curves and policy behavior.
Study checklist
- ✅ Compare TD(0), SARSA, and Q-learning
- ✅ Understand bootstrap targets
- ✅ Reason about on-policy vs off-policy
Common mistakes
- ⚠️ Not handling terminal states correctly
- ⚠️ Mixing policy used for acting and updating
Module rhythm
- 1. Read the summary and why-it-matters section first.
- 2. Work through concepts before rushing into practice.
- 3. Use the checklist to verify real understanding, not just completion.
How to use this page well
Treat each module as a compact learning system: understand the intuition, verify the concepts, do one hands-on task, then use the checklist and mistakes section to pressure-test your understanding.