Tianshou CartPole Pipeline / Tianshou CartPole 实验管线

This project teaches you how to stop rewriting reinforcement learning infrastructure from scratch. Instead, you learn how collectors, policies, replay buffers, and trainers fit together into a reusable experiment pipeline.
这个项目会让你不再反复重写强化学习基础设施，而是理解 collector、policy、replay buffer 和 trainer 如何组成可复用实验管线。

Project Background / 项目背景

After learning RL from scratch, the next bottleneck is no longer the Bellman equation, it is engineering repetition. Real teams eventually rely on libraries such as Tianshou to standardize collectors, buffers, and trainers so experiments become faster and more reproducible.
当你学过手写 RL 之后，下一个瓶颈通常不再是 Bellman 方程，而是工程重复劳动。真实团队最终会依赖 Tianshou 这样的库，把 collector、buffer 和 trainer 标准化，让实验更快、更稳定、更可复现。

Problem it solves / 它要解决什么问题

The real problem is not CartPole itself. The problem is how to stop rebuilding RL infrastructure every time you try a new experiment. This project solves that by teaching how framework abstractions map to the concepts you already know, so you gain speed without losing understanding.
这个项目要解决的并不是 CartPole 本身，而是“每做一个新实验就重搭一遍 RL 基础设施”的低效问题。它通过解释框架抽象和你已掌握概念之间的映射关系，让你在不失去理解力的前提下获得开发速度。

What you learn / 你会学到什么

▸ DQN pipeline with a framework / 使用框架搭建 DQN 管线
▸ Collector and vectorized env usage / Collector 与向量化环境使用
▸ Config-driven experimentation / 配置驱动实验
▸ Reusable RL engineering patterns / 可复用 RL 工程模式

policy = ts.policy.DQNPolicy(
    model=q_net,
    optim=optim,
    discount_factor=0.99,
    estimation_step=3,
    target_update_freq=320,
)

Code walkthrough / 代码要点解释

Abstraction is the main lesson / abstraction 才是核心 lesson： Tianshou hides boilerplate, but you still need to understand what flows through each abstraction or you will not know how to debug. / Tianshou 会隐藏大量样板代码，但如果你不知道每层 abstraction 里流动的是什么，训练失败时就没法 debug。

Frameworks change what you optimize for / 框架改变你的优化重心： instead of hand-writing loops, you spend more effort on experiment design, metrics, and reproducibility. / 你不再花大量时间手写循环，而是把精力更多放在实验设计、指标和可复现性上。

Policy, collector, and trainer own different responsibilities / policy、collector、trainer 各管一摊： the policy decides actions and learning logic, the collector interacts with environments, and the trainer coordinates the overall schedule. / policy 负责动作和学习逻辑，collector 负责和环境交互，trainer 负责总调度。把三者分清，框架才不会变成黑盒。

The goal is reusable experimentation / 目标是可复用实验： the real win is not shorter code, it is being able to swap seeds, environments, and configs without rewriting infrastructure. / 真正的收益不只是代码更短，而是你可以切换 seed、环境和配置，而不用重写基础设施。

Full runnable code / 完整可运行代码

A minimal but complete Tianshou DQN training pipeline on CartPole-v1. Save this as tianshou_cartpole.py and install the listed dependencies for the project stack.
A minimal but complete Tianshou DQN training pipeline on CartPole-v1. 可将下面代码保存为 tianshou_cartpole.py，并安装对应项目依赖后直接运行。

Dependencies / 依赖

▸ python>=3.10
▸ torch
▸ tianshou
▸ gymnasium
▸ numpy

Run commands / 运行命令

pip install torch tianshou gymnasium numpy

python tianshou_cartpole.py

File tree / 目录结构

tianshou-cartpole/
├── tianshou_cartpole.py
├── configs/
│   └── dqn_cartpole.yaml
└── logs/
    └── train.log

import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
from tianshou.data import Collector, VectorReplayBuffer
from tianshou.env import DummyVectorEnv
from tianshou.policy import DQNPolicy
from tianshou.trainer import offpolicy_trainer


train_envs = DummyVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(4)])
test_envs = DummyVectorEnv([lambda: gym.make('CartPole-v1') for _ in range(4)])
state_shape = train_envs.observation_space.shape or train_envs.observation_space.n
action_shape = train_envs.action_space.n


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(state_shape[0], 128), nn.ReLU(),
            nn.Linear(128, 128), nn.ReLU(),
            nn.Linear(128, action_shape),
        )

    def forward(self, obs, state=None, info=None):
        if not isinstance(obs, torch.Tensor):
            obs = torch.tensor(obs, dtype=torch.float32)
        return self.model(obs), state


net = Net()
optim = torch.optim.Adam(net.parameters(), lr=1e-3)
policy = DQNPolicy(
    model=net,
    optim=optim,
    discount_factor=0.99,
    estimation_step=3,
    target_update_freq=320,
)

train_collector = Collector(policy, train_envs, VectorReplayBuffer(20000, len(train_envs)))
test_collector = Collector(policy, test_envs)
train_collector.collect(n_step=1024)

result = offpolicy_trainer(
    policy,
    train_collector,
    test_collector,
    max_epoch=5,
    step_per_epoch=2000,
    step_per_collect=10,
    episode_per_test=5,
    batch_size=64,
    update_per_step=0.1,
)
print(result)

Build Steps / 构建步骤

Model the workflow / 先建模工作流

Define config, seeds, and logging before writing training code so experiments stay reproducible. / 在写训练代码前先定义配置、seed 和日志，让实验可复现。

Build policy and collectors / 搭建 policy 与 collector

Implement a small Q-network, wrap it in DQNPolicy, and connect collectors to vectorized environments. / 实现小型 Q 网络，封装成 DQNPolicy，并把 collector 接到向量化环境。

Use the trainer abstraction / 使用 trainer 抽象

Let Tianshou handle the experiment loop while you focus on metrics and failure analysis. / 让 Tianshou 处理实验循环，而你把注意力放在指标和错误分析上。

Generalize the pipeline / 推广管线

Make the same code easy to extend from CartPole to harder tasks. / 让同一套代码从 CartPole 平滑扩展到更难任务。

Common Pitfalls / 常见坑

▸ Using the framework without understanding transitions / 只会调用框架，不理解 transition 流动
▸ Wrong env seed handling / 环境 seed 处理错误
▸ Confusing trainer metrics with true evaluation / 把 trainer 指标误当成真实评估
▸ Reusing configs without checking hidden defaults / 复用配置时忽略隐藏默认值

Success Criteria / 完成标准

✅ CartPole training is reproducible across seeds / CartPole 训练结果在多 seed 下可复现
✅ You can explain what collector, policy, and trainer each own / 你能解释 collector、policy、trainer 各自负责什么
✅ The pipeline is easy to extend to a second environment / 这套管线可以平滑扩展到第二个环境
✅ You can map framework abstractions back to scratch RL concepts / 你能把框架抽象重新映射回手写 RL 概念