我正在尝试实现从 Geron 的“Hands-On Machine Learning”一书中获取的策略梯度,可以在这里找到。笔记本使用 Tensorflow,我正在尝试使用 PyTorch。
我的模型如下所示:
model = nn.Sequential(
nn.Linear(4, 128),
nn.ELU(),
nn.Linear(128, 2),
)
标准和优化器:
criterion = nn.BCEWithLogitsLoss()
optim = torch.optim.Adam(model.parameters(), lr=0.01)
训练:
env = gym.make("CartPole-v0")
n_games_per_update = 10
n_max_steps = 1000
n_iterations = 250
save_iterations = 10
discount_rate = 0.95
for iteration in range(n_iterations): # Run the game 250 times
all_rewards = []
all_gradients = []
n_steps = []
optim.zero_grad()
for game in range(n_games_per_update): # Run the game 10 times to accumulate gradients
current_rewards = []
current_gradients = []
obs = env.reset()
for step in range(n_max_steps): # Run a single game a maximum of 1000 steps
logit = model(torch.tensor(obs, dtype=torch.float))
output = F.softmax(logit, dim=0)
c = Categorical(output)
action = c.sample()
y = torch.tensor([1.0 - action, action], dtype=torch.float)
loss = criterion(logit, y)
loss.backward()
obs, reward, done, info = env.step(int(action))
current_rewards.append(reward)
current_gradients.append([p.grad for p in model.parameters()])
if done:
break
n_steps.append(step)
all_rewards.append(current_rewards)
all_gradients.append(current_gradients)
# Performs the discount and normalises
all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate=discount_rate)
# For each batch of 10 games multiply the discounted rewards against the gradients of the
# network. Then take the mean for each layer
new_gradients = []
for var_index, gradient_placeholder in enumerate(gradient_placeholders):
means = []
for game_index, rewards in enumerate(all_rewards):
for step, reward in enumerate(rewards):
means.append(reward * all_gradients[game_index][step][var_index])
new_gradients.append(torch.mean(torch.stack(means), 0, True).squeeze(0))
# Apply the new gradients to the network
for p, g in zip(model.parameters(), new_gradients):
p.grad = g.clone()
optim.step()
当我运行 250 次交互的代码时,我打印出我得到的平均游戏长度:
Iteration: 50, Average Length: 18.2
Iteration: 100, Average Length: 23.4
Iteration: 150, Average Length: 29.9
Iteration: 200, Average Length: 11.2
Iteration: 250, Average Length: 38.6
网络并没有真正改善,长时间的训练也无济于事。我的两个问题是: 1. 我做的有什么明显的错误吗?2.我注意到在tensorflow实现中使用了概率的日志,但我不知道如何在这里整合它