人工智能 - q 学习似乎收敛，但并不总是能战胜随机井字游戏玩家 - 吾爱随笔录

q 学习似乎收敛，但并不总是能战胜随机井字游戏玩家

人工智能强化学习 Python q学习游戏-ai 组合游戏

2021-10-31 16:38:02

q 学习定义为：

这是我对井字棋问题的q学习的实现：

import timeit
from operator import attrgetter
import time
import matplotlib.pyplot
import pylab
from collections import Counter
import logging.handlers
import sys
import configparser
import logging.handlers
import unittest
import json, hmac, hashlib, time, requests, base64
from requests.auth import AuthBase
from pandas.io.json import json_normalize
from multiprocessing.dummy import Pool as ThreadPool
import threading
import time
from statistics import mean 
import statistics as st
import os   
from collections import Counter
import matplotlib.pyplot as plt
from sklearn import preprocessing
from datetime import datetime
import datetime
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib
import numpy as np
import pandas as pd
from functools import reduce
from ast import literal_eval
import unittest
import math
from datetime import date, timedelta
import random

today = datetime.today()
model_execution_start_time = str(today.year)+"-"+str(today.month)+"-"+str(today.day)+" "+str(today.hour)+":"+str(today.minute)+":"+str(today.second)

epsilon = .1
discount = .1
step_size = .1
number_episodes = 30000

def epsilon_greedy(epsilon, state, q_table) : 
    
    def get_valid_index(state):
        i = 0
        valid_index = []
        for a in state :          
            if a == '-' :
                valid_index.append(i)
            i = i + 1
        return valid_index
    
    def get_arg_max_sub(values , indices) : 
        return max(list(zip(np.array(values)[indices],indices)),key=lambda item:item[0])[1]
    
    if np.random.rand() < epsilon:
        return random.choice(get_valid_index(state))
    else :
        if state not in q_table : 
            q_table[state] = np.array([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
        q_row = q_table[state]
        return get_arg_max_sub(q_row , get_valid_index(state))
    
def make_move(current_player, current_state , action):
    if current_player == 'X':
        return current_state[:action] + 'X' + current_state[action+1:]
    else : 
        return current_state[:action] + 'O' + current_state[action+1:]

q_table = {}
max_steps = 9

def get_other_player(p):
    if p == 'X':
        return 'O'
    else : 
        return 'X'
    
def win_by_diagonal(mark , board):
    return (board[0] == mark and board[4] == mark and board[8] == mark) or (board[2] == mark and board[4] == mark and board[6] == mark)
    
def win_by_vertical(mark , board):
    return (board[0] == mark and board[3] == mark and board[6] == mark) or (board[1] == mark and board[4] == mark and board[7] == mark) or (board[2] == mark and board[5] == mark and board[8]== mark)

def win_by_horizontal(mark , board):
    return (board[0] == mark and board[1] == mark and board[2] == mark) or (board[3] == mark and board[4] == mark and board[5] == mark) or (board[6] == mark and board[7] == mark and board[8] == mark)

def win(mark , board):
    return win_by_diagonal(mark, board) or win_by_vertical(mark, board) or win_by_horizontal(mark, board)

def draw(board):
    return win('X' , list(board)) == False and win('O' , list(board)) == False and (list(board).count('-') == 0)

s = []
rewards = []
def get_reward(state):
    reward = 0
    if win('X' ,list(state)):
        reward = 1
        rewards.append(reward)
    elif draw(state) :
        reward = -1
        rewards.append(reward)
    else :
        reward = 0
        rewards.append(reward)
        
    return reward

def get_done(state):
    return win('X' ,list(state)) or win('O' , list(state)) or draw(list(state)) or (state.count('-') == 0)
    
reward_per_episode = []
            
reward = []
def q_learning():
    for episode in range(0 , number_episodes) :
        t = 0
        state = '---------'

        player = 'X'
        random_player = 'O'


        if episode % 1000 == 0:
            print('in episode:',episode)

        done = False
        episode_reward = 0
            
        while t < max_steps:

            t = t + 1

            action = epsilon_greedy(epsilon , state , q_table)

            done = get_done(state)

            if done == True : 
                break

            if state not in q_table : 
                q_table[state] = np.array([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])

            next_state = make_move(player , state , action)
            reward = get_reward(next_state)
            episode_reward = episode_reward + reward
            
            done = get_done(next_state)

            if done == True :
                q_table[state][action] = q_table[state][action] + (step_size * (reward - q_table[state][action]))
                break

            next_action = epsilon_greedy(epsilon , next_state , q_table)
            if next_state not in q_table : 
                q_table[next_state] = np.array([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])

            q_table[state][action] = q_table[state][action] + (step_size * (reward + (discount * np.max(q_table[next_state]) - q_table[state][action])))

            state = next_state

            player = get_other_player(player)
            
        reward_per_episode.append(episode_reward)

q_learning()

算法播放器被分配到“X”，而另一个播放器被分配到“O”：

    player = 'X'
    random_player = 'O'

每集奖励：

plt.grid()
plt.plot([sum(i) for i in np.array_split(reward_per_episode, 15)])

呈现：

与随机移动的对手对战模型：

## Computer opponent that makes random moves against trained RL computer opponent
# Random takes move for player marking O position
# RL agent takes move for player marking X position

def draw(board):
    return win('X' , list(board)) == False and win('O' , list(board)) == False and (list(board).count('-') == 0)

x_win = []
o_win = []
draw_games = []
number_games = 50000

c = []
o = []

for ii in range (0 , number_games):
    
    if ii % 10000 == 0 and ii > 0:
        print('In game ',ii)
        print('The number of X game wins' , sum(x_win))
        print('The number of O game wins' , sum(o_win))
        print('The number of drawn games' , sum(draw_games))

    available_moves = [0,1,2,3,4,5,6,7,8]
    current_game_state = '---------'
    
    computer = ''
    random_player = ''
    
    computer = 'X'
    random_player = 'O'

    def draw(board):
        return win('X' , list(board)) == False and win('O' , list(board)) == False and (list(board).count('-') == 0)
        
    number_moves = 0
    
    for i in range(0 , 5):

        randomer_move = random.choice(available_moves)
        number_moves = number_moves + 1
        current_game_state = current_game_state[:randomer_move] + random_player + current_game_state[randomer_move+1:]
        available_moves.remove(randomer_move)

        if number_moves == 9 : 
            draw_games.append(1)
            break
        if win('O' , list(current_game_state)) == True:
            o_win.append(1)
            break
        elif win('X' , list(current_game_state)) == True:
            x_win.append(1)
            break
        elif draw(current_game_state) == True:
            draw_games.append(1)
            break
            
        computer_move_pos = epsilon_greedy(-1, current_game_state, q_table)
        number_moves = number_moves + 1
        current_game_state = current_game_state[:computer_move_pos] + computer + current_game_state[computer_move_pos+1:]
        available_moves.remove(computer_move_pos)
     
        if number_moves == 9 : 
            draw_games.append(1)
#             print(current_game_state)
            break
            
        if win('O' , list(current_game_state)) == True:
            o_win.append(1)
            break
        elif win('X' , list(current_game_state)) == True:
            x_win.append(1)
            break
        elif draw(current_game_state) == True:
            draw_games.append(1)
            break

输出：

In game  10000
The number of X game wins 4429
The number of O game wins 3006
The number of drawn games 2565
In game  20000
The number of X game wins 8862
The number of O game wins 5974
The number of drawn games 5164
In game  30000
The number of X game wins 13268
The number of O game wins 8984
The number of drawn games 7748
In game  40000
The number of X game wins 17681
The number of O game wins 12000
The number of drawn games 10319

每集的奖励图表明算法已经收敛？如果模型已经收敛，O 游戏获胜的次数不应该为零吗？

1个回答

t我看到的主要问题是，在每个训练集的时间步循环中，您为两个玩家（他们应该有彼此相反的目标）选择动作，但更新一个q_table（这对于“视角”来说永远是正确的） “你的两个玩家之一）在这两个动作上，并使用一个共享的奖励函数更新他们两个。

直觉上，我猜这意味着你的学习算法假设你的对手总是会帮助你获胜，而不是假设你的对手为自己的目标做出了最佳的表现。您可以从您的情节中看到这确实可能是这种情况；你用 $30,000$ 训练集，分为 $15$ 大块的 $2,000$ 您的情节的每块剧集。在你的情节中，你也很快达到了大约 $1,950$ 每块，这几乎是最大可能的！现在，我不能 100% 确定最佳玩家对抗随机玩家的胜率是多少，但我认为这可能会低于 2000 年的 1950 年。随机玩家偶尔会在井字游戏中取得平局，特别是考虑到您的学习代理本身也没有发挥最佳效果（但是 $\epsilon$ ——贪婪地）！

您应该选择以下解决方案之一（也许还有更多解决方案，这正是我当场提出的）：

跟踪两个不同的表 $Q$ - 两个不同玩家的值，并且只在一半的动作上更新他们每个人（他们每个人都假装对手选择的动作只是“环境”或“世界”创建的随机状态转换）。有关这些方案的外观的更多信息，请参阅此答案。
只跟踪一个 $Q$ -您自己的代理的值（再次仅在上述一半操作上更新它 - 特别是仅在您的代理实际选择的操作上）。对方球员的行动不应基于相同的选择 $Q$ -values，而是通过一些不同的方法。例如，您可以通过 minimax 或 alpha-beta 修剪搜索算法选择相反的动作。也许选择它们以最小化而不是最大化相同的值 $Q$ -table 也可以工作（不认为这个想法完全通过，不是 100% 确定）。您可能也可以随机选择对手的动作，但是您的代理只会学会与随机对手打好，而不一定要与强大的对手对抗。

在研究了上述建议之后，您可能还想确保您的代理体验从玩家 1 开始的游戏，以及从玩家 2 开始的游戏，并为这两种可能的情况进行训练并学习如何处理这两种情况。在你的评估代码中（训练后），我相信你总是让 Random 对手先玩，然后训练好的代理玩第二？如果您没有在训练集中介绍此场景，您的代理可能无法学习如何正确处理它。

最后，一些小笔记：

您的折扣系数 $\gamma 0.1$ 具有极小的价值。文学中的共同价值观是 $\gamma = 0.9$ , $\gamma = 0.95$ ，甚至 $\gamma = 0.99$ . 井字游戏的情节往往总是很短，而且我们往往不太关心快速获胜而不是缓慢获胜（胜利就是胜利），所以我倾向于使用像这样的高价值 $\gamma = 0.99$ .
一个小的编程技巧，不是真正的 AI 专用：您的代码包含各种形式的条件if <condition> == True :，例如：if done == True :. 该== True部分是多余的，这些条件可以更简单地写为 just if done:。

其它你可能感兴趣的问题

上一篇哪篇论文介绍了“softmax”这个术语？下一篇如何表达vπ（秒）vπ(s)按照qπ(小号,一)qπ(s,a)?