我的井字游戏 Q 学习算法出了什么问题?

数据挖掘 Python q学习
2022-03-02 18:11:26

我的 Q 学习算法目前选择“次优”选项而不是最佳选项。

from pprint import pprint
from random import shuffle, choice, random
from operator import itemgetter


epsilon = .03  # Exploration
gamma = .9  # Learning rate
epochs = 100000  # Number of matches vs itself
states = {}

WON = 1
LOSE = 0
TIE = 0.5

COLS = 3
ROWS = 3
SIZE = COLS * ROWS

def buildBoard():
    return [''] * SIZE


def prettyPrint(board):
    if type(board) != list:
        board = eval(board)

    print('+-----+')
    for i in range(SIZE):
        print('|{}'.format(board[i] if board[i] != '' else ' '), end='')
        if i in (2, 5, 8):
            print('|')
    print('+-----+')


def isFinished(board):
    # +-----+
    # |0|1|2|
    # |3|4|5|
    # |6|7|8|
    # +-----+
    for player in ('X', 'O'):
        # Check rows
        for i in range(0, SIZE, 3):
            if board[i] == board[i + 1] == board[i + 2] == player:
                return {'winner': player, 'finished': True}

        # Check cols
        for i in range(ROWS):
            if board[i] == board[i + 3] == board[i + 6] == player:
                return {'winner': player, 'finished': True}

    # Check diagonals
    if (board[0] == board[4] == board[8] or 
        board[2] == board[4] == board[6]) and board[4] != '':
        return {'winner': board[4], 'finished': True}

    # Is it a draw?
    for i in range(SIZE):
        if board[i] == '':
            return {'winner': None, 'finished': False}

    # Still running
    return {'winner': None, 'finished': True}


def genStates(state, player='X'):
    for i in range(SIZE):
        tmp = state[:]
        if tmp[i] == '':
            tmp[i] = player
            info = isFinished(tmp)

            if not info['finished']:
                reward = 0.5
                genStates(tmp, 'O' if player == 'X' else 'X')
            elif info["winner"] == 'X':
                reward = WON
            elif info["winner"] == 'O':
                reward = LOSE
            else:
                reward = TIE

            states[str(tmp)] = reward



def getAvalaibleStates(state, player):
    availables = []
    for i in range(SIZE):
        tmp = state[:]
        if tmp[i] == '':
            tmp[i] = player
            availables.append((tmp, states[str(tmp)]))

    shuffle(availables)
    return availables


def nextState(state, player, eps=epsilon):
    availables = getAvalaibleStates(state, player)

    if eps > random():
        return availables[0][0]

    # X player choices the max value, O the min value
    if player == 'X':
        availables.sort(key=itemgetter(1), reverse=True)
    else:
        availables.sort(key=itemgetter(1))

    return availables[0][0]


def updateState(state, nextState):
    states[str(state)] = states[str(state)] + gamma * states[str(nextState)]


def play():
    while True:
        state = buildBoard()
        who = input("X/O: ")
        player = 'X'
        prettyPrint(state)
        while not isFinished(state)['finished']:  # Until finish
            if player == who:
                state[int(input("[0-8]: "))] = who
            else:
                next = nextState(state, player)
                state = next

            player = 'O' if player == 'X' else 'X'

            prettyPrint(state)
        print(isFinished(state))

def train(epochs=epochs):
    wins, loses, draw = 0, 0, 0
    for _ in range(epochs):
        state = buildBoard()
        player = 'X'
        while not isFinished(state)['finished']:  # Until finish
            next = nextState(state, player)
            updateState(state, next)
            state = next
            if player == 'X':
                player = 'O'
            else:
                player = 'X'

        if isFinished(state)['winner'] == 'X':
            wins += 1
        elif isFinished(state)['winner'] == 'O':
            loses += 1
        else:
            draw += 1

        if _ % 10000 == 0:
            print("Epoch n°", _)
            print("wins: {} - loses: {} - draw: {}\n".format(wins, loses, draw))
            if wins == 0 and _ > 0:
                break
            wins, loses, draw = 0, 0, 0


state = buildBoard()
states[str(state)] = TIE
genStates(state)
train()
play()

我对 epsilon 或学习率有太大的价值吗?还是我需要一个不同的公式来更新 Q 状态?还是我以不好的方式分享状态?

谢谢

1个回答

您在分配奖励和更新机制方面存在一些错误。

您打算为失败授予 0 奖励,为平局授予 0.5 奖励,为获胜授予 1 奖励。并且您将这些奖励作为已完成板的固定状态值放置,逻辑上不会更新。没关系(前提是您确实注意不要更新这些值)。

在该genStates函数中,您还为不完整的棋盘状态分配了 0.5 的中间值。这也可以作为初始值(原则上任何值都可以)。然而,它指出了一个问题——你似乎没有区分状态值和奖励。

您也没有学习率,只有一个折扣因子(出于某种原因,您称之为学习率 - 它不是)。碰巧,在这样一个简单的游戏中,两者都可以为 1。

使用后态的 Q 学习的更新函数应该是:

V(s)V(s)+α(R+γmaxs[V(s)]V(s))

如果你设置α=1(正如您实际上拥有的那样),那么这将变为:

V(s)R+γmaxs[V(s)]

您的代码正在实现:

V(s)V(s)+γV(s)

你需要将价值函数和奖励计算分离出来,这样你就可以使用R并不是V(s)在更新中。这不仅应该解决问题,而且会使算法的工作方式更加清晰。对不完整游戏最简单的奖励应该是0- 其他固定值可能有效,但如果它们高到足以干扰赢/输/平局奖励,那么其中一位玩家可能更愿意早早输掉比赛,而不是在更长的比赛中获胜或平局。所以坚持0.

此外,在进行探索性动作时,您不会使用最大操作进行更新。这使您的算法成为 SARSA,而不是 Q 学习。在较低的探索率下,这可能不会对最终的状态值产生太大影响。但是,它们实际上并不代表最佳游戏,而是接近它的东西。如果真的要实现Q学习,就需要updateState根据最大化(或最小化)动作来调用,不管玩家在实际游戏中是否采取了。

请注意,通常在零和游戏中,奖励为 -1(失败)、0(平局)和 +1(获胜)。但是,我认为这对您没有任何影响,两人游戏中的重要技巧是在最小化和最大化价值函数的动作之间切换,具体取决于轮到谁,在我看来,您正确地做到这一点。

期望玩几次 10,000 场比赛才能学习完美的价值函数。