普通策略梯度算法原理及PyTorch实现【VPG】

大家好，我是讯享网，很高兴认识大家。

有没有想过强化学习 (RL) 是如何工作的？

在本文中，我们将从头开始构建最简单的强化学习形式之一 —普通策略梯度（VPG）算法。然后，我们将训练它完成著名的 CartPole 挑战 — 学习从左向右移动购物车以平衡杆子。在此过程中，我们还将完成对 OpenAI 的 Spinning Up 学习资源的第一个挑战。

讯享网

NSDT工具推荐： Three.js AI纹理开发包 - YOLO合成数据生成器 - GLTF/GLB在线编辑 - 3D模型格式在线转换 - 可编程3D场景编辑器 - REVIT导出3D模型插件 - 3D模型语义搜索引擎

本文的代码可以在这里找到。

1、我们的方法

我们将通过创建一个简单的深度学习模型来解决这个问题，该模型接受观察并输出随机策略（即采取每个可能行动的概率）。

然后，我们需要做的就是通过在环境中采取行动并使用此策略来收集经验。

当我们有足够的批量经验（几个episode经验的集合）后，我们需要转向梯度下降来改进模型。在较高层面上，我们希望增加策略的预期回报，这意味着调整权重和偏差以增加高预期回报行动的概率。就 VPG 而言，这意味着使用策略梯度定理，该定理给出了该预期回报的梯度方程（如下所示）。

这就是全部内容了—所以让我们开始编码吧！

2、创建模型

我们将首先创建一个带有一个隐藏层的非常简单的模型。第一个线性层从 CartPole 的观察空间获取输入特征，最后一层返回可能结果的值。

def create_model(number_observation_features: int, number_actions: int) -> nn.Module: """Create the MLP model Args: number_observation_features (int): Number of features in the (flat) observation tensor number_actions (int): Number of actions Returns: nn.Module: Simple MLP model """ hidden_layer_features = 32 return nn.Sequential( nn.Linear(in_features=number_observation_features, out_features=hidden_layer_features), nn.ReLU(), nn.Linear(in_features=hidden_layer_features, out_features=number_actions), )

讯享网

3、获取策略

我们还需要为每个时间步获取一个模型策略（以便我们知道如何采取行动）。为此，我们将创建一个 get_policy 函数，该函数使用模型输出策略下每个操作的概率。然后，我们可以返回一个分类（多项式）分布，该分布可用于选择根据这些概率随机分布的特定动作。

讯享网def get_policy(model: nn.Module, observation: np.ndarray) -> Categorical: """Get the policy from the model, for a specific observation Args: model (nn.Module): MLP model observation (np.ndarray): Environment observation Returns: Categorical: Multinomial distribution parameterized by model logits """ observation_tensor = torch.as_tensor(observation, dtype=torch.float32) logits = model(observation_tensor) # Categorical will also normalize the logits for us return Categorical(logits=logits)

4、从策略中采样动作

从这个分类分布中，对于每个时间步长，我们可以对其进行采样以返回一个动作。我们还将获得该动作的对数概率，这在稍后计算梯度时会很有用。

def get_action(policy: Categorical) -> tuple[int, float]: """Sample an action from the policy Args: policy (Categorical): Policy Returns: tuple[int, float]: Tuple of the action and it's log probability """ action = policy.sample() # Unit tensor # Converts to an int, as this is what Gym environments require action_int = action.item() # Calculate the log probability of the action, which is required for # calculating the loss later log_probability_action = policy.log_prob(action) return action_int, log_probability_action

5、计算损失

梯度的完整推导如这里所示。宽松地说，它是每个状态-动作对的对数概率之和乘以该对所属的整个轨迹的回报的梯度。额外的外层和汇总若干个情节（即一批），因此我们有重要的数据。

要使用 PyTorch 计算此值，我们可以做的是计算下面的伪损失，然后使用 .backward() 获取上面的梯度（注意我们刚刚删除了梯度项）：

这通常被称为损失，但它并不是真正的损失，因为它不依赖于模型的性能。它只是对于获取策略梯度有用。

讯享网def calculate_loss(epoch_log_probability_actions: torch.Tensor, epoch_action_rewards: torch.Tensor) -> float:
    """Calculate the 'loss' required to get the policy gradient

    Formula for gradient at
    https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient

    Note that this isn't really loss - it's just the sum of the log probability
    of each action times the episode return. We calculate this so we can
    back-propagate to get the policy gradient.

    Args:
        epoch_log_probability_actions (torch.Tensor): Log probabilities of the
            actions taken
        epoch_action_rewards (torch.Tensor): Rewards for each of these actions

    Returns:
        float: Pseudo-loss
    """
    return -(epoch_log_probability_actions * epoch_action_rewards).mean()

6、单个epoch训练

将以上所有内容放在一起，我们现在准备好训练一个epoch了。为此，我们只需循环播放情节（episode）即可创建批次。在每个情节中，创建一系列可用于训练模型的动作和奖励（即经验）。

def train_one_epoch(env: gym.Env, model: nn.Module, optimizer: Optimizer, max_timesteps=5000, episode_timesteps=200) -> float: """Train the model for one epoch Args: env (gym.Env): Gym environment model (nn.Module): Model optimizer (Optimizer): Optimizer max_timesteps (int, optional): Max timesteps per epoch. Note if an episode is part-way through, it will still complete before finishing the epoch. Defaults to 5000. episode_timesteps (int, optional): Timesteps per episode. Defaults to 200. Returns: float: Average return from the epoch """ epoch_total_timesteps = 0 # Returns from each episode (to keep track of progress) epoch_returns: list[int] = [] # Action log probabilities and rewards per step (for calculating loss) epoch_log_probability_actions = [] epoch_action_rewards = [] # Loop through episodes while True: # Stop if we've done over the total number of timesteps if epoch_total_timesteps > max_timesteps: break # Running total of this episode's rewards episode_reward: int = 0 # Reset the environment and get a fresh observation observation = env.reset() # Loop through timesteps until the episode is done (or the max is hit) for timestep in range(episode_timesteps): epoch_total_timesteps += 1 # Get the policy and act policy = get_policy(model, observation) action, log_probability_action = get_action(policy) observation, reward, done, _ = env.step(action) # Increment the episode rewards episode_reward += reward # Add epoch action log probabilities epoch_log_probability_actions.append(log_probability_action) # Finish the action loop if this episode is done if done == True: # Add one reward per timestep for _ in range(timestep + 1): epoch_action_rewards.append(episode_reward) break # Increment the epoch returns epoch_returns.append(episode_reward) # Calculate the policy gradient, and use it to step the weights & biases epoch_loss = calculate_loss(torch.stack( epoch_log_probability_actions), torch.as_tensor( epoch_action_rewards, dtype=torch.float32) ) epoch_loss.backward() optimizer.step() optimizer.zero_grad() return np.mean(epoch_returns)

7、运行算法

现在可以运行算法了。

讯享网def train(epochs=40) -> None: """Train a Vanilla Policy Gradient model on CartPole Args: epochs (int, optional): The number of epochs to run for. Defaults to 50. """ # Create the Gym Environment env = gym.make('CartPole-v0') # Use random seeds (to make experiments deterministic) torch.manual_seed(0) env.seed(0) # Create the MLP model number_observation_features = env.observation_space.shape[0] number_actions = env.action_space.n model = create_model(number_observation_features, number_actions) # Create the optimizer optimizer = Adam(model.parameters(), 1e-2) # Loop for each epoch for epoch in range(epochs): average_return = train_one_epoch(env, model, optimizer) print('epoch: %3d \t return: %.3f' % (epoch, average_return)) if __name__ == '__main__': train()

大约 40 个 epoch 后，可以看到模型已经很好地学习了环境（得分 180+/ 200）：

epoch: 26 return: 118.070 epoch: 27 return: 114.659 epoch: 28 return: 135.405 epoch: 29 return: 144.000 epoch: 30 return: 143.972 epoch: 31 return: 152.091 epoch: 32 return: 166.065 epoch: 33 return: 162.613 epoch: 34 return: 166.806 epoch: 35 return: 172.933 epoch: 36 return: 173.241 epoch: 37 return: 181.071 epoch: 38 return: 186.222 epoch: 39 return: 176.793

原文链接：普通策略梯度实现 - BimAnt