PPO MuJoCo HalfCheetah / PPO 训练 HalfCheetah

This project is where policy optimization stops being a slogan and becomes engineering. You will implement PPO with GAE, minibatch updates, and rollout management.
这个项目会让 policy optimization 不再只是概念，而是真正落到工程实现。你会实现 PPO、GAE、minibatch 更新和 rollout 管理。

Project Background / 项目背景

If DQN represents the pixel-control era of value learning, PPO represents the practical rise of policy optimization for continuous control. MuJoCo tasks matter because they force you to deal with smooth action spaces, unstable updates, and the real cost of noisy gradients.
如果说 DQN 代表了 value learning 的像素控制时代，那么 PPO 则代表了连续控制里 policy optimization 的实用化。MuJoCo 任务之所以重要，是因为它迫使你面对连续动作空间、不稳定更新，以及高噪声梯度的真实代价。

Problem it solves / 它要解决什么问题

The deeper problem is how to improve a policy in continuous action spaces without letting each update destroy previously useful behavior. PPO solves this by using clipped policy updates, actor-critic structure, and GAE to control variance while still making progress.
这个项目要解决的核心问题是：在连续动作空间里，如何一边改进策略，一边避免每次更新都把原本有用的行为破坏掉。PPO 的解法是 clipped policy update、actor-critic 结构和 GAE，用更稳定的方式持续推进训练。

What you learn / 你会学到什么

▸ Actor-critic design / actor-critic 设计
▸ GAE and variance reduction / GAE 与方差降低
▸ Continuous-action Gaussian policy / 连续动作高斯策略
▸ Rollout collection and policy update separation / rollout 收集与策略更新分离

ratio = torch.exp(new_logp - old_logp)
clip_ratio = torch.clamp(ratio, 1 - eps, 1 + eps)
policy_loss = -(torch.min(ratio * adv, clip_ratio * adv)).mean()
value_loss = ((value - returns) ** 2).mean()

Code walkthrough / 代码要点解释

The clipped objective defines PPO / clipped objective 定义了 PPO： it constrains policy updates so the new policy cannot drift too far from the old one in a single optimization phase. / 它约束策略更新幅度，避免新策略在单次优化阶段里偏离旧策略太远。

GAE is practical variance control / GAE 是实用型方差控制： it is what makes PPO training much less noisy in practice. / 它让 PPO 在实践中显著减少训练噪声。

Rollout storage is part of the algorithm / rollout 存储本身就是算法的一部分： if trajectories, dones, or bootstrap values are assembled incorrectly, the loss may still run while learning silently collapses. / 如果轨迹、done 标志或 bootstrap value 拼错了，loss 可能照样能跑，但学习会静默崩塌。

The Gaussian head is where continuous control becomes real / 高斯策略头是连续控制真正落地的地方： unlike discrete policies, you must reason about means, log-stds, sampling noise, and entropy regularization. / 与离散策略不同，在这里你必须真正理解均值、log-std、采样噪声和 entropy regularization。

Full runnable code / 完整可运行代码

A compact PPO implementation for continuous control environments like HalfCheetah. Save this as ppo_continuous_train.py and install the listed dependencies for the project stack.
A compact PPO implementation for continuous control environments like HalfCheetah. 可将下面代码保存为 ppo_continuous_train.py，并安装对应项目依赖后直接运行。

Dependencies / 依赖

▸ python>=3.10
▸ torch
▸ gymnasium[mujoco]
▸ numpy

Run commands / 运行命令

pip install torch numpy "gymnasium[mujoco]"

python ppo_continuous_train.py

File tree / 目录结构

ppo-mujoco/
├── ppo_continuous_train.py
├── videos/
│   └── rollout.mp4
└── checkpoints/
    └── actor_critic.pt

import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
from torch.distributions import Normal


device = 'cuda' if torch.cuda.is_available() else 'cpu'
env = gym.make('HalfCheetah-v4')
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.shape[0]


class ActorCritic(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.Sequential(nn.Linear(obs_dim, 128), nn.Tanh(), nn.Linear(128, 128), nn.Tanh())
        self.mu = nn.Linear(128, act_dim)
        self.log_std = nn.Parameter(torch.zeros(act_dim))
        self.value = nn.Linear(128, 1)

    def forward(self, x):
        h = self.backbone(x)
        return self.mu(h), self.log_std.exp().expand_as(self.mu(h)), self.value(h)


model = ActorCritic().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
clip_eps = 0.2
gamma = 0.99
lam = 0.95


for episode in range(10):
    obs, _ = env.reset(seed=episode)
    traj = []
    done = False
    while not done:
        x = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
        mu, std, value = model(x)
        dist = Normal(mu, std)
        action = dist.sample()
        logp = dist.log_prob(action).sum(dim=-1)
        next_obs, reward, terminated, truncated, _ = env.step(action.squeeze(0).cpu().numpy())
        traj.append((obs, action.squeeze(0).detach().cpu().numpy(), reward, value.item(), logp.item(), terminated or truncated))
        obs = next_obs
        done = terminated or truncated

    rewards = [t[2] for t in traj]
    values = [t[3] for t in traj] + [0.0]
    advs, gae = [], 0.0
    for t in reversed(range(len(traj))):
        delta = rewards[t] + gamma * values[t + 1] * (1 - traj[t][5]) - values[t]
        gae = delta + gamma * lam * (1 - traj[t][5]) * gae
        advs.insert(0, gae)
    returns = [a + v for a, v in zip(advs, values[:-1])]

    obs_t = torch.tensor(np.array([t[0] for t in traj]), dtype=torch.float32, device=device)
    act_t = torch.tensor(np.array([t[1] for t in traj]), dtype=torch.float32, device=device)
    old_logp = torch.tensor([t[4] for t in traj], dtype=torch.float32, device=device)
    adv_t = torch.tensor(advs, dtype=torch.float32, device=device)
    ret_t = torch.tensor(returns, dtype=torch.float32, device=device)
    adv_t = (adv_t - adv_t.mean()) / (adv_t.std() + 1e-8)

    mu, std, values_pred = model(obs_t)
    dist = Normal(mu, std)
    new_logp = dist.log_prob(act_t).sum(dim=-1)
    ratio = torch.exp(new_logp - old_logp)
    clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
    policy_loss = -(torch.min(ratio * adv_t, clipped * adv_t)).mean()
    value_loss = ((values_pred.squeeze(-1) - ret_t) ** 2).mean()
    loss = policy_loss + 0.5 * value_loss

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    print(f'episode={episode} return={sum(rewards):.1f} loss={loss.item():.4f}')

Build Steps / 构建步骤

Rollout pipeline / rollout 管线

Use vectorized environments, preallocated rollout storage, and clean separation between data collection and policy updates. / 使用向量化环境、预分配 rollout storage，并严格分离数据收集与策略更新。

Actor-critic networks / actor-critic 网络

Implement a Gaussian policy head and value head with sensible initialization. / 实现高斯策略头和值函数头，并使用合理初始化。

GAE and PPO loss / GAE 与 PPO 损失

Compute advantages carefully, normalize them, and implement clipped policy and value objectives. / 小心计算 advantage 并归一化，实现带 clipping 的策略损失和值函数损失。

Minibatch optimization / 小批量优化

Shuffle rollouts into minibatches and run multiple epochs without breaking on-policy assumptions too badly. / 将 rollout 打乱成 minibatch，并做多轮优化，同时尽量不破坏 on-policy 假设。

Common Pitfalls / 常见坑

▸ Advantage normalization omitted / 忘了做 advantage normalization
▸ Ratio clipping implemented incorrectly / ratio clipping 写错
▸ Mixing rollout and update phases / rollout 阶段和更新阶段混在一起
▸ Treating MuJoCo reward as instantly stable / 误以为 MuJoCo 奖励很快稳定

Success Criteria / 完成标准

✅ PPO loss terms are implemented correctly / PPO 各损失项实现正确
✅ Rollout and update phases are clearly separated / rollout 与 update 阶段清晰分离
✅ Training curve improves over time / 训练曲线随时间明显改善
✅ You can explain why clipping matters more than just quoting the formula / 你能解释 clipping 的作用，而不是只会背公式