Module 6Reinforcement Learning

Policy Gradients and REINFORCE

Optimize policies directly and feel the pain of variance firsthand.

Why this module matters

Policy gradients open the door to continuous control and modern actor-critic methods.

Prerequisites

  • Probability basics
  • MDP intuition

Learning objectives

  • Derive the score-function estimator
  • Understand baseline variance reduction
  • Train REINFORCE on a simple environment

Core concepts

Log-derivative trick
Return-weighted gradients
Variance reduction

Hands-on practice

  • Train REINFORCE with and without baseline

Expected output

A small experiment showing why variance reduction matters.

Study checklist

  • Derive the score-function estimator
  • Understand baseline variance reduction
  • Train REINFORCE on a simple environment

Common mistakes

  • ⚠️ Assuming policy gradients are automatically stable
  • ⚠️ Ignoring reward scaling

Module rhythm

  • 1. Read the summary and why-it-matters section first.
  • 2. Work through concepts before rushing into practice.
  • 3. Use the checklist to verify real understanding, not just completion.

How to continue

Now add a critic and move into actor-critic systems.

Back to course overview →

How to use this page well

Treat each module as a compact learning system: understand the intuition, verify the concepts, do one hands-on task, then use the checklist and mistakes section to pressure-test your understanding.