🎮
Intermediate

Reinforcement Learning

RL is where many smart engineers get lost because it mixes probability, optimization, function approximation, and systems issues all at once. This course slows the field down into understandable steps, then rebuilds it into scratch implementations and Tianshou workflows.

~34 hours📦 11 modules🐍 PyTorch 2.x · Gymnasium · Tianshou

How beginners should use this course

  • Start with tabular environments. They are not toys, they are microscopes.
  • Plot everything: reward, episode length, value estimates, exploration rate, and losses.
  • Keep your scratch implementations even after switching to Tianshou. They are your debugging reference.
  • Do not trust any RL result you cannot reproduce across seeds and reruns.

Mathematical Foundations

Return and discounting

RL optimizes expected future return, not immediate reward alone.

Discounting captures the intuition that near-term outcomes usually matter more or are more certain.

Once this clicks, most value-function equations become easier to interpret.

Bellman recursion

Bellman equations decompose a long-horizon problem into one-step reward plus future value.

That recursive structure unifies dynamic programming, TD, Q-learning, and PPO thinking.

RL feels fragmented until you see this shared backbone clearly.

Bias-variance tradeoffs everywhere

Monte Carlo, TD, GAE, replay buffers, and target networks all manage different bias-variance tradeoffs.

A lot of RL engineering is about accepting tolerable bias in exchange for trainable variance.

Once learners see that, the field stops looking random and starts looking structured.

Detailed Modules

01

MDPs and Bellman Thinking

Learn the formal language of RL before touching algorithms.

You will learn

  • What states, actions, rewards, transitions, and return actually mean
  • Why Bellman equations sit at the center of RL reasoning
  • How discounting changes optimization goals

Hands-on practice

Model a toy GridWorld as an MDP and write out returns for several trajectories.

Expected output

A short MDP worksheet with Bellman updates done by hand.

02

Dynamic Programming

Solve small environments exactly so value iteration intuition becomes concrete.

You will learn

  • Difference between policy evaluation and policy improvement
  • How value iteration converges to an optimal policy
  • Why tabular exact methods matter even if they do not scale

Hands-on practice

Implement policy iteration and value iteration on GridWorld.

Expected output

A working tabular solver plus value heatmaps.

03

Monte Carlo Methods

Estimate value from full episodes without bootstrapping.

You will learn

  • How first-visit and every-visit Monte Carlo differ
  • Why Monte Carlo can be unbiased but high variance
  • How importance sampling enters off-policy correction

Hands-on practice

Estimate state values from sampled episodes and compare variance across seeds.

Expected output

A notebook comparing Monte Carlo estimates against exact tabular solutions.

04

Temporal Difference Learning

Bridge Monte Carlo and dynamic programming with bootstrapped learning.

You will learn

  • How TD(0), SARSA, and Q-learning differ
  • Why off-policy Q-learning can learn optimal behavior from exploratory data
  • How done masks and bootstrapping interact

Hands-on practice

Train TD variants on Taxi-v3 and compare convergence speed.

Expected output

A benchmark plot showing reward curves for SARSA vs Q-learning.

05

Deep Q-Networks

Move from tables to neural approximators without losing algorithmic clarity.

You will learn

  • Why replay buffers and target networks stabilize learning
  • How ε-greedy exploration influences data quality
  • Why Atari preprocessing is part of the algorithm

Hands-on practice

Build a DQN agent for Pong or CartPole with replay and target updates.

Expected output

A scratch DQN implementation with reward and Q-value stability plots.

06

Policy Gradient and REINFORCE

Optimize policies directly and understand the variance pain that comes with it.

You will learn

  • Where the log-derivative trick comes from
  • Why policy gradients are unbiased but noisy
  • How baselines reduce variance without bias

Hands-on practice

Train REINFORCE on CartPole and compare raw returns with baseline-corrected returns.

Expected output

A small experiment showing why variance reduction matters.

07

Actor-Critic Methods

Combine value estimation and policy learning into a more practical training loop.

You will learn

  • How the critic provides lower-variance learning signals
  • Why advantage estimates bridge value and policy learning
  • How n-step returns improve learning speed

Hands-on practice

Implement a simple A2C loop on a low-dimensional environment.

Expected output

A working actor-critic baseline with policy and value loss tracking.

08

Proximal Policy Optimization

Learn the standard practical policy-gradient algorithm used in many modern baselines.

You will learn

  • Why clipped objectives stabilize updates
  • How GAE balances bias and variance
  • How rollout length, mini-batches, and entropy bonuses interact

Hands-on practice

Build PPO in PyTorch and train on a continuous-control task.

Expected output

A PPO trainer with separate actor/critic losses and rollout diagnostics.

09

RL Engineering with Tianshou

Move from algorithm demos to reproducible experiment pipelines.

You will learn

  • How Tianshou structures policies, collectors, buffers, and trainers
  • Why libraries matter once experiments become repetitive
  • How to preserve understanding while still using abstractions

Hands-on practice

Rebuild a previous scratch algorithm using Tianshou components.

Expected output

A clean experiment config that reruns the same policy reliably.

10

CartPole Pipeline Project

Turn a small RL experiment into a real engineering workflow.

You will learn

  • How to parameterize runs and compare seeds systematically
  • How to save checkpoints and evaluation policies
  • How to prepare a lightweight RL experiment for reuse

Hands-on practice

Package a CartPole Tianshou pipeline with config files and logging.

Expected output

A reusable RL project template with training and evaluation entrypoints.

11

Capstone: PPO on MuJoCo

Capstone

Apply everything to a harder continuous-control setting.

You will learn

  • How PPO behaves in continuous action spaces
  • How reward scale and observation normalization affect training
  • How to compare scratch and library implementations critically

Hands-on practice

Train HalfCheetah-v4 with scratch PPO and compare it with a Tianshou implementation.

Expected output

A capstone report with reward curves, failure modes, and engineering tradeoffs.

Common Pitfalls

⚠️

Treating reward as ground-truth quality

Reward is only as good as the environment signal. Agents can exploit reward design flaws while still looking successful.

⚠️

Debugging too late

A rollout collection bug can waste hours silently. Validate tiny runs and inspect transitions before committing to long training.

⚠️

Ignoring variance across seeds

One good run proves almost nothing. RL outcomes often swing dramatically with seed choice.

⚠️

Jumping to MuJoCo too early

If CartPole and Taxi are not deeply understood, PPO on continuous control will feel like magic and failure will be impossible to diagnose.

🏁 Capstone: PPO MuJoCo Agent

The capstone forces you to combine theory, implementation, logging discipline, and engineering judgment. If you can finish this project and explain why PPO worked or failed, you have moved from RL tourist to RL practitioner.