政策梯度不是“学习”

数据挖掘 强化学习 火炬 执行 政策梯度
2022-02-28 17:04:36

我正在尝试实现从 Geron 的“Hands-On Machine Learning”一书中获取的策略梯度,可以在这里找到。笔记本使用 Tensorflow,我正在尝试使用 PyTorch。

我的模型如下所示:

model = nn.Sequential(
    nn.Linear(4, 128),
    nn.ELU(),
    nn.Linear(128, 2),
)

标准和优化器:

criterion = nn.BCEWithLogitsLoss()
optim = torch.optim.Adam(model.parameters(), lr=0.01)

训练:

env = gym.make("CartPole-v0")

n_games_per_update = 10
n_max_steps = 1000
n_iterations = 250
save_iterations = 10
discount_rate = 0.95


for iteration in range(n_iterations): # Run the game 250 times
    all_rewards = []
    all_gradients = []
    n_steps = []
    optim.zero_grad()
    for game in range(n_games_per_update): # Run the game 10 times to accumulate gradients
        current_rewards = []
        current_gradients = []
        obs = env.reset()
        for step in range(n_max_steps): # Run a single game a maximum of 1000 steps

            logit = model(torch.tensor(obs, dtype=torch.float))
            output = F.softmax(logit, dim=0)
            c = Categorical(output)
            action = c.sample()

            y = torch.tensor([1.0 - action, action], dtype=torch.float)
            loss = criterion(logit, y)
            loss.backward()

            obs, reward, done, info = env.step(int(action))
            current_rewards.append(reward)
            current_gradients.append([p.grad for p in model.parameters()])
            if done:
                break
        n_steps.append(step)

        all_rewards.append(current_rewards)
        all_gradients.append(current_gradients)

    # Performs the discount and normalises
    all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate=discount_rate)

    # For each batch of 10 games multiply the discounted rewards against the gradients of the 
    # network. Then take the mean for each layer
    new_gradients = []
    for var_index, gradient_placeholder in enumerate(gradient_placeholders):
        means = []
        for game_index, rewards in enumerate(all_rewards):
            for step, reward in enumerate(rewards):
                means.append(reward * all_gradients[game_index][step][var_index])
        new_gradients.append(torch.mean(torch.stack(means), 0, True).squeeze(0))

    # Apply the new gradients to the network
    for p, g in zip(model.parameters(), new_gradients):
        p.grad = g.clone()
    optim.step()

当我运行 250 次交互的代码时,我打印出我得到的平均游戏长度:

Iteration: 50, Average Length: 18.2
Iteration: 100, Average Length: 23.4
Iteration: 150, Average Length: 29.9
Iteration: 200, Average Length: 11.2
Iteration: 250, Average Length: 38.6

网络并没有真正改善,长时间的训练也无济于事。我的两个问题是: 1. 我做的有什么明显的错误吗?2.我注意到在tensorflow实现中使用了概率的日志,但我不知道如何在这里整合它

1个回答

我不能肯定地说,但我认为这里的问题是你没有减去奖励的平均值。

这个想法是,在平均归一化后,奖励高于平均水平的行动是积极的,而平均回报低于平均水平的行动在平均归一化后是消极的。

您的更新步骤是-log(P(action))*reward,然后您使用优化器将其最小化。

P(action)<1因此log(P(action))<0-log(P(action))>0

如果reward>0-log(P(action))*reward>0最小化这个值与最大化相同log(P(action))*reward<0,当 最大化时P(action)=1

相反,如果reward<0-log(P(action))*reward<0这具有相反的效果,其中P(action)被驱动为 0。

重要的部分是高于平均/低于平均奖励的不同符号导致与良好奖励相关的动作的概率增加,而与坏奖励相关的动作的概率降低。