Module 4Reinforcement Learning

Temporal Difference Learning

Blend sampling and bootstrapping into practical value learning.

Why this module matters

TD methods sit at the center of classic RL and explain why bootstrapping changes stability and speed.

Prerequisites

  • Monte Carlo basics

Learning objectives

  • Compare TD(0), SARSA, and Q-learning
  • Understand bootstrap targets
  • Reason about on-policy vs off-policy

Core concepts

Bootstrapping
TD target
On-policy vs off-policy

Hands-on practice

  • Train SARSA and Q-learning on Taxi-v3

Expected output

A comparison of learning curves and policy behavior.

Study checklist

  • Compare TD(0), SARSA, and Q-learning
  • Understand bootstrap targets
  • Reason about on-policy vs off-policy

Common mistakes

  • ⚠️ Not handling terminal states correctly
  • ⚠️ Mixing policy used for acting and updating

Module rhythm

  • 1. Read the summary and why-it-matters section first.
  • 2. Work through concepts before rushing into practice.
  • 3. Use the checklist to verify real understanding, not just completion.

How to continue

Then replace tables with neural approximators through DQN.

Back to course overview →

How to use this page well

Treat each module as a compact learning system: understand the intuition, verify the concepts, do one hands-on task, then use the checklist and mistakes section to pressure-test your understanding.