Module 6Reinforcement Learning
Policy Gradients and REINFORCE
Optimize policies directly and feel the pain of variance firsthand.
Why this module matters
Policy gradients open the door to continuous control and modern actor-critic methods.
Prerequisites
- ▸ Probability basics
- ▸ MDP intuition
Learning objectives
- ▸ Derive the score-function estimator
- ▸ Understand baseline variance reduction
- ▸ Train REINFORCE on a simple environment
Core concepts
Log-derivative trick
Return-weighted gradients
Variance reduction
Hands-on practice
- ▸ Train REINFORCE with and without baseline
Expected output
A small experiment showing why variance reduction matters.
Study checklist
- ✅ Derive the score-function estimator
- ✅ Understand baseline variance reduction
- ✅ Train REINFORCE on a simple environment
Common mistakes
- ⚠️ Assuming policy gradients are automatically stable
- ⚠️ Ignoring reward scaling
Module rhythm
- 1. Read the summary and why-it-matters section first.
- 2. Work through concepts before rushing into practice.
- 3. Use the checklist to verify real understanding, not just completion.
How to use this page well
Treat each module as a compact learning system: understand the intuition, verify the concepts, do one hands-on task, then use the checklist and mistakes section to pressure-test your understanding.