Reinforcement Learning
RL is where many smart engineers get lost because it mixes probability, optimization, function approximation, and systems issues all at once. This course slows the field down into understandable steps, then rebuilds it into scratch implementations and Tianshou workflows.
How beginners should use this course
- ▸ Start with tabular environments. They are not toys, they are microscopes.
- ▸ Plot everything: reward, episode length, value estimates, exploration rate, and losses.
- ▸ Keep your scratch implementations even after switching to Tianshou. They are your debugging reference.
- ▸ Do not trust any RL result you cannot reproduce across seeds and reruns.
Mathematical Foundations
Return and discounting
RL optimizes expected future return, not immediate reward alone.
Discounting captures the intuition that near-term outcomes usually matter more or are more certain.
Once this clicks, most value-function equations become easier to interpret.
Bellman recursion
Bellman equations decompose a long-horizon problem into one-step reward plus future value.
That recursive structure unifies dynamic programming, TD, Q-learning, and PPO thinking.
RL feels fragmented until you see this shared backbone clearly.
Bias-variance tradeoffs everywhere
Monte Carlo, TD, GAE, replay buffers, and target networks all manage different bias-variance tradeoffs.
A lot of RL engineering is about accepting tolerable bias in exchange for trainable variance.
Once learners see that, the field stops looking random and starts looking structured.
Detailed Modules
MDPs and Bellman Thinking
Learn the formal language of RL before touching algorithms.
You will learn
- ▸ What states, actions, rewards, transitions, and return actually mean
- ▸ Why Bellman equations sit at the center of RL reasoning
- ▸ How discounting changes optimization goals
Hands-on practice
Model a toy GridWorld as an MDP and write out returns for several trajectories.
Expected output
A short MDP worksheet with Bellman updates done by hand.
Dynamic Programming
Solve small environments exactly so value iteration intuition becomes concrete.
You will learn
- ▸ Difference between policy evaluation and policy improvement
- ▸ How value iteration converges to an optimal policy
- ▸ Why tabular exact methods matter even if they do not scale
Hands-on practice
Implement policy iteration and value iteration on GridWorld.
Expected output
A working tabular solver plus value heatmaps.
Monte Carlo Methods
Estimate value from full episodes without bootstrapping.
You will learn
- ▸ How first-visit and every-visit Monte Carlo differ
- ▸ Why Monte Carlo can be unbiased but high variance
- ▸ How importance sampling enters off-policy correction
Hands-on practice
Estimate state values from sampled episodes and compare variance across seeds.
Expected output
A notebook comparing Monte Carlo estimates against exact tabular solutions.
Temporal Difference Learning
Bridge Monte Carlo and dynamic programming with bootstrapped learning.
You will learn
- ▸ How TD(0), SARSA, and Q-learning differ
- ▸ Why off-policy Q-learning can learn optimal behavior from exploratory data
- ▸ How done masks and bootstrapping interact
Hands-on practice
Train TD variants on Taxi-v3 and compare convergence speed.
Expected output
A benchmark plot showing reward curves for SARSA vs Q-learning.
Deep Q-Networks
Move from tables to neural approximators without losing algorithmic clarity.
You will learn
- ▸ Why replay buffers and target networks stabilize learning
- ▸ How ε-greedy exploration influences data quality
- ▸ Why Atari preprocessing is part of the algorithm
Hands-on practice
Build a DQN agent for Pong or CartPole with replay and target updates.
Expected output
A scratch DQN implementation with reward and Q-value stability plots.
Policy Gradient and REINFORCE
Optimize policies directly and understand the variance pain that comes with it.
You will learn
- ▸ Where the log-derivative trick comes from
- ▸ Why policy gradients are unbiased but noisy
- ▸ How baselines reduce variance without bias
Hands-on practice
Train REINFORCE on CartPole and compare raw returns with baseline-corrected returns.
Expected output
A small experiment showing why variance reduction matters.
Actor-Critic Methods
Combine value estimation and policy learning into a more practical training loop.
You will learn
- ▸ How the critic provides lower-variance learning signals
- ▸ Why advantage estimates bridge value and policy learning
- ▸ How n-step returns improve learning speed
Hands-on practice
Implement a simple A2C loop on a low-dimensional environment.
Expected output
A working actor-critic baseline with policy and value loss tracking.
Proximal Policy Optimization
Learn the standard practical policy-gradient algorithm used in many modern baselines.
You will learn
- ▸ Why clipped objectives stabilize updates
- ▸ How GAE balances bias and variance
- ▸ How rollout length, mini-batches, and entropy bonuses interact
Hands-on practice
Build PPO in PyTorch and train on a continuous-control task.
Expected output
A PPO trainer with separate actor/critic losses and rollout diagnostics.
RL Engineering with Tianshou
Move from algorithm demos to reproducible experiment pipelines.
You will learn
- ▸ How Tianshou structures policies, collectors, buffers, and trainers
- ▸ Why libraries matter once experiments become repetitive
- ▸ How to preserve understanding while still using abstractions
Hands-on practice
Rebuild a previous scratch algorithm using Tianshou components.
Expected output
A clean experiment config that reruns the same policy reliably.
CartPole Pipeline Project
Turn a small RL experiment into a real engineering workflow.
You will learn
- ▸ How to parameterize runs and compare seeds systematically
- ▸ How to save checkpoints and evaluation policies
- ▸ How to prepare a lightweight RL experiment for reuse
Hands-on practice
Package a CartPole Tianshou pipeline with config files and logging.
Expected output
A reusable RL project template with training and evaluation entrypoints.
Capstone: PPO on MuJoCo
CapstoneApply everything to a harder continuous-control setting.
You will learn
- ▸ How PPO behaves in continuous action spaces
- ▸ How reward scale and observation normalization affect training
- ▸ How to compare scratch and library implementations critically
Hands-on practice
Train HalfCheetah-v4 with scratch PPO and compare it with a Tianshou implementation.
Expected output
A capstone report with reward curves, failure modes, and engineering tradeoffs.
Common Pitfalls
Treating reward as ground-truth quality
Reward is only as good as the environment signal. Agents can exploit reward design flaws while still looking successful.
Debugging too late
A rollout collection bug can waste hours silently. Validate tiny runs and inspect transitions before committing to long training.
Ignoring variance across seeds
One good run proves almost nothing. RL outcomes often swing dramatically with seed choice.
Jumping to MuJoCo too early
If CartPole and Taxi are not deeply understood, PPO on continuous control will feel like magic and failure will be impossible to diagnose.
🏁 Capstone: PPO MuJoCo Agent
The capstone forces you to combine theory, implementation, logging discipline, and engineering judgment. If you can finish this project and explain why PPO worked or failed, you have moved from RL tourist to RL practitioner.