我已经使用 DQN 算法训练了一个 RL 代理。在 20000 集之后,我的奖励收敛了。现在,当我测试这个代理时,代理总是采取相同的行动,而与状态无关。我觉得这很奇怪。有人可以帮我弄这个吗。有没有原因,任何人都可以想到为什么代理会这样?
奖励情节
当我测试代理时
state = env.reset()
print('State: ', state)
state_encod = np.reshape(state, [1, state_size])
q_values = model.predict(state_encod)
action_key = np.argmax(q_values)
print(action_key)
print(index_to_action_mapping[action_key])
print(q_values[0][0])
print(q_values[0][action_key])
q_values_plotting = []
for i in range(0,action_size):
q_values_plotting.append(q_values[0][i])
plt.plot(np.arange(0,action_size),q_values_plotting)
每次它给出相同的 q_values 图,即使每次初始化的状态都不同。下面是 q_Value 图。
测试:
代码
test_rewards = []
for episode in range(1000):
terminal_state = False
state = env.reset()
episode_reward = 0
while terminal_state == False:
print('State: ', state)
state_encod = np.reshape(state, [1, state_size])
q_values = model.predict(state_encod)
action_key = np.argmax(q_values)
action = index_to_action_mapping[action_key]
print('Action: ', action)
next_state, reward, terminal_state = env.step(state, action)
print('Next_state: ', next_state)
print('Reward: ', reward)
print('Terminal_state: ', terminal_state, '\n')
print('----------------------------')
episode_reward += reward
state = deepcopy(next_state)
print('Episode Reward' + str(episode_reward))
test_rewards.append(episode_reward)
plt.plot(test_rewards)
谢谢。


