根据 DeepMind 的出版物,我重新创建了环境,并试图让 DQN 找到并收敛到最优策略。智能体的任务是学习如何可持续地收集苹果(物体),苹果的再生取决于其空间配置(周围的苹果越多,再生越高)。简而言之:代理必须找到尽可能多的收集苹果的方法(收集一个苹果,他会获得 +1 的奖励),同时允许它们重新生长,从而最大化他的奖励(如果他也耗尽了资源很快,他就失去了未来的奖励)。网格游戏在下图中可见,其中玩家是红色方块,他的方向是灰色的,苹果绿色:
如出版物中所述,我构建了一个 DQN 来解决游戏。然而,不管学习率、损失、探索率及其衰减、批量大小、优化器、重放缓冲区、增加 NN 大小,DQN 都找不到下图所示的最佳策略:
我想知道我的 DQN 中是否有错误代码(使用类似的实现,我设法解决了 OpenAI Gym CartPole 任务。)在下面粘贴我的代码:
class DDQNAgent(RLDebugger):
def __init__(self, observation_space, action_space):
RLDebugger.__init__(self)
# get size of state and action
self.state_size = observation_space[0]
self.action_size = action_space
# hyper parameters
self.learning_rate = .00025
self.model = self.build_model()
self.target_model = self.model
self.gamma = 0.999
self.epsilon_max = 1.
self.epsilon = 1.
self.t = 0
self.epsilon_min = 0.1
self.n_first_exploration_steps = 1500
self.epsilon_decay_len = 1000000
self.batch_size = 32
self.train_start = 64
# create replay memory using deque
self.memory = deque(maxlen=1000000)
self.target_model = self.build_model(trainable=False)
# approximate Q function using Neural Network
# state is input and Q Value of each action is output of network
def build_model(self, trainable=True):
model = Sequential()
# This is a simple one hidden layer model, thought it should be enough here,
# it is much easier to train with different achitectures (stack layers, change activation)
model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
model.add(Dense(32, activation='relu', trainable=trainable))
model.add(Dense(self.action_size, activation='linear', trainable=trainable))
model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
model.summary()
# 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
# 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
return model
# get action from model using greedy policy
def get_action(self, state):
if random.random() < self.epsilon:
return random.randrange(self.action_size)
q_value = self.model.predict(state)
return np.argmax(q_value[0])
# decay epsilon
def update_epsilon(self):
self.t += 1
self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
(self.epsilon_decay_len - max(0.,
self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)
# train the target network on the selected action and transition
def train_model(self, action, state, next_state, reward, done):
# save sample <s,a,r,s'> to the replay memory
self.memory.append((state, action, reward, next_state, done))
if len(self.memory) >= self.train_start:
states, actions, rewards, next_states, dones = self.create_minibatch()
targets = self.model.predict(states)
target_values = self.target_model.predict(next_states)
for i in range(self.batch_size):
# Approx Q Learning
if dones[i]:
targets[i][actions[i]] = rewards[i]
else:
targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))
# and do the model fit!
loss = self.model.fit(states, targets, verbose=0).history['loss'][0]
for i in range(self.batch_size):
self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])
def create_minibatch(self):
# pick samples randomly from replay memory (using batch_size)
batch_size = min(self.batch_size, len(self.memory))
samples = random.sample(self.memory, batch_size)
states = np.array([_[0][0] for _ in samples])
actions = np.array([_[1] for _ in samples])
rewards = np.array([_[2] for _ in samples])
next_states = np.array([_[3][0] for _ in samples])
dones = np.array([_[4] for _ in samples])
return (states, actions, rewards, next_states, dones)
def update_target_model(self):
self.target_model.set_weights(self.model.get_weights())
这是我用来训练模型的代码:
from dqn_agent import *
from environment import *
env = GameEnv()
observation_space = env.reset()
agent = DDQNAgent(observation_space.shape, 7)
state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
episode += 1
state = env.reset()
state = np.reshape(state, [1, state_size])
#if episode % 100 == 0:
# env.render_env()
total_reward = 0
step = 0
gameover = False
while not gameover:
step += 1
#if episode % 100 == 0:
# env.render_env()
action = agent.get_action(state)
reward, next_state, done = env.step(action)
next_state = np.reshape(next_state, [1, state_size])
total_reward += reward
agent.train_model(action, state, next_state, reward, done)
agent.update_epsilon()
state = next_state
terminal = (step >= max_episode_len)
if done or terminal:
last_rewards.append(total_reward)
agent.update_target_model()
gameover = True
print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)
在每集之后更新模型(集 = 1000 步)。
查看日志,代理有时往往会连续多次获得非常高的结果,但总是无法稳定,并且每集的结果具有极高的方差(即使在增加 epsilon 并运行了几千集之后)。看看我的代码和游戏,你有什么想法可以帮助算法稳定结果/收敛吗?我一直在玩超参数,但没有什么能带来非常显着的改进。
游戏和训练的一些参数: 奖励:收集每个苹果(绿色方块)+1 情节:1000 步,1000 步后或如果玩家完全耗尽资源,游戏会自动重置。目标模型更新:每次游戏结束后超参数可以在上面的代码中找到。
如果您有任何想法,请告诉我,很高兴分享 github 存储库。随时给我发电子邮件 macwiatrak@gmail.com
PS 我知道这与下面介绍的问题类似。但是我已经尝试了那里的建议,但没有成功,因此决定提出另一个问题。 DQN 无法学习或收敛
编辑:添加了奖励图(下)。