DQN 找不到最优策略

数据挖掘 强化学习 q学习 dqn 收敛 深思熟虑
2022-02-22 00:52:15

根据 DeepMind 的出版物,我重新创建了环境,并试图让 DQN 找到并收敛到最优策略。智能体的任务是学习如何可持续地收集苹果(物体),苹果的再生取决于其空间配置(周围的苹果越多,再生越高)。简而言之:代理必须找到尽可能多的收集苹果的方法(收集一个苹果,他会获得 +1 的奖励),同时允许它们重新生长,从而最大化他的奖励(如果他也耗尽了资源很快,他就失去了未来的奖励)。网格游戏在下图中可见,其中玩家是红色方块,他的方向是灰色的,苹果绿色: 在此处输入图像描述 如出版物中所述,我构建了一个 DQN 来解决游戏。然而,不管学习率、损失、探索率及其衰减、批量大小、优化器、重放缓冲区、增加 NN 大小,DQN 都找不到下图所示的最佳策略: 在此处输入图像描述 我想知道我的 DQN 中是否有错误代码(使用类似的实现,我设法解决了 OpenAI Gym CartPole 任务。)在下面粘贴我的代码:

class DDQNAgent(RLDebugger):
    def __init__(self, observation_space, action_space):
        RLDebugger.__init__(self)
        # get size of state and action
        self.state_size = observation_space[0]
        self.action_size = action_space
        # hyper parameters
        self.learning_rate = .00025
        self.model = self.build_model()
        self.target_model = self.model
        self.gamma = 0.999
        self.epsilon_max = 1.
        self.epsilon = 1.
        self.t = 0
        self.epsilon_min = 0.1
        self.n_first_exploration_steps = 1500
        self.epsilon_decay_len = 1000000
        self.batch_size = 32
        self.train_start = 64
        # create replay memory using deque
        self.memory = deque(maxlen=1000000)
        self.target_model = self.build_model(trainable=False)

    # approximate Q function using Neural Network
    # state is input and Q Value of each action is output of network
    def build_model(self, trainable=True):
        model = Sequential()
        # This is a simple one hidden layer model, thought it should be enough here,
        # it is much easier to train with different achitectures (stack layers, change activation)
        model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
        model.add(Dense(32, activation='relu', trainable=trainable))
        model.add(Dense(self.action_size, activation='linear', trainable=trainable))
        model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
        model.summary()
        # 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
        # 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
        return model

    # get action from model using greedy policy
    def get_action(self, state):
        if random.random() < self.epsilon:
            return random.randrange(self.action_size)
        q_value = self.model.predict(state)
        return np.argmax(q_value[0])

    # decay epsilon
    def update_epsilon(self):
        self.t += 1
        self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
                        (self.epsilon_decay_len - max(0.,
                                 self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)

    # train the target network on the selected action and transition
    def train_model(self, action, state, next_state, reward, done):

        # save sample <s,a,r,s'> to the replay memory
        self.memory.append((state, action, reward, next_state, done))

        if len(self.memory) >= self.train_start:
            states, actions, rewards, next_states, dones = self.create_minibatch()

            targets = self.model.predict(states)
            target_values = self.target_model.predict(next_states)

            for i in range(self.batch_size):
                # Approx Q Learning
                if dones[i]:
                    targets[i][actions[i]] = rewards[i]
                else:
                    targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))

            # and do the model fit!
            loss = self.model.fit(states, targets, verbose=0).history['loss'][0]

            for i in range(self.batch_size):
                self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])

    def create_minibatch(self):
        # pick samples randomly from replay memory (using batch_size)

        batch_size = min(self.batch_size, len(self.memory))
        samples = random.sample(self.memory, batch_size)

        states = np.array([_[0][0] for _ in samples])
        actions = np.array([_[1] for _ in samples])
        rewards = np.array([_[2] for _ in samples])
        next_states = np.array([_[3][0] for _ in samples])
        dones = np.array([_[4] for _ in samples])

        return (states, actions, rewards, next_states, dones)

    def update_target_model(self):
        self.target_model.set_weights(self.model.get_weights())

这是我用来训练模型的代码:

from dqn_agent import *
from environment import *

env = GameEnv()
observation_space = env.reset()

agent = DDQNAgent(observation_space.shape, 7)

state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
    episode += 1
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    #if episode % 100 == 0:
     #   env.render_env()
    total_reward = 0

     step = 0
       gameover = False
    while not gameover:
        step += 1
        #if episode % 100 == 0:
         #   env.render_env()
        action = agent.get_action(state)
        reward, next_state, done = env.step(action)
        next_state = np.reshape(next_state, [1, state_size])
        total_reward += reward
        agent.train_model(action, state, next_state, reward, done)
        agent.update_epsilon()
        state = next_state
        terminal = (step >= max_episode_len)
        if done or terminal:
            last_rewards.append(total_reward)
            agent.update_target_model()
            gameover = True

    print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)

在每集之后更新模型(集 = 1000 步)。

查看日志,代理有时往往会连续多次获得非常高的结果,但总是无法稳定,并且每集的结果具有极高的方差(即使在增加 epsilon 并运行了几千集之后)。看看我的代码和游戏,你有什么想法可以帮助算法稳定结果/收敛吗?我一直在玩超参数,但没有什么能带来非常显着的改进。

游戏和训练的一些参数: 奖励:收集每个苹果(绿色方块)+1 情节:1000 步,1000 步后或如果玩家完全耗尽资源,游戏会自动重置。目标模型更新:每次游戏结束后超参数可以在上面的代码中找到。

如果您有任何想法,请告诉我,很高兴分享 github 存储库。随时给我发电子邮件 macwiatrak@gmail.com

PS 我知道这与下面介绍的问题类似。但是我已经尝试了那里的建议,但没有成功,因此决定提出另一个问题。 DQN 无法学习或收敛

编辑:添加了奖励图(下)。

在此处输入图像描述

1个回答

你说代理可以取得很高的结果,所以它似乎正在学习一些东西,但它不能始终如一地做到这一点。在某些时候性能是否有显着下降?

如果使用 Adam 而不是 RMSProp,结果是否没有改善?如果这是一个需要大量调整的“困难”环境,我会感到惊讶。我很好奇你的奖励图与我认为在这种环境中也能生存的“随机”行为相比是什么样的。