人工智能 - 强化学习代理对抗随机对手的井字游戏的最佳分数是多少？ - 吾爱随笔录

强化学习代理对抗随机对手的井字游戏的最佳分数是多少？

人工智能强化学习深度学习深度学习井字游戏

2021-11-02 05:55:42

我想每个尝试用各种强化学习解决井字游戏的人都会遇到这个问题。

答案不是“总是赢”，因为随机对手有时可能会平局。所以它比总是赢的分数略低。

我写了一个小 Python 程序来计算它。请帮助验证其正确性并通知我是否有错误或错误。

2个回答

这篇博文建议，在与随机对手比赛时，如果代理人先出牌，胜率是 97.8%，如果他们出牌，胜率是 79.6%（其余为平局）。

这里我假设：

双方球员完美避免非法动作
玩家 X 总是选择期望值最大的走法
玩家 O 以相同的概率选择所有可用的移动

结果取决于评分方案：

（此方案用于 Gym Tic Tac Toe 的一个版本）

赢=20，平=10，输=-20：

最优期望值 =
- X 先播放：19.94791666666666...
- O先玩：19.164021164021...
赢=20，平=0，输=-20：

最优期望值 =
- X 先播放：19.89583333333333...
- O 首先播放：18.497354497354....

它还有助于使用包含在代码中的一些预先播放的棋盘位置来验证程序。

这是程序：

import math     # for math.inf = infinity

print("Calculate optimal expectation value of TicTacToe")
print("from the perspective of 'X' = first player.")
print("Assume both players perfectly avoid illegal moves.")
print("Player 'X' always chooses the move with maximum expectation value.")
print("Player 'O' always plays all available moves with equal probability.")
print("You may modify the initial board position in the code.")

# Empty board
test_board = 9 * [0]

# Pre-moves, if any are desired:
# X|O|
# O|O|X
# X| |
#test_board[0] = -1
#test_board[3] = 1
#test_board[6] = -1
#test_board[4] = 1
#test_board[5] = -1
#test_board[1] = 1

def show_board(board):
    for i in [0, 3, 6]:
        for j in range(3):
            x = board[i + j]
            if x == -1:
                c = '❌'
            elif x == 1:
                c = '⭕'
            else:
                c = '  '
            print(c, end='')
        print(end='\n')

if test_board != 9 * [0]:
    print("\nInitial board position:")
    show_board(test_board)

# **** Calculate expectation value of input board position
def expectation(board, player):

    if player == -1:
        # **** Find all possible next moves for player 'X'
        moves = possible_moves(board)

        # Calculate expectation of these moves;
        # Player 'X' will only choose the one of maximum value.
        max_v = - math.inf
        for m in moves:
            new_board = board.copy()
            new_board[m] = -1       # Player 'X'

            # If this an ending move?
            r = game_over(new_board, -1)
            if r is not None:
                if r > max_v:
                    max_v = r
            else:
                v = expectation(new_board, 1)
                if v > max_v:
                    max_v = v
        # show_board(board)
        print("X's turn.  Expectation w.r.t. Player X =", max_v, end='\r')
        return max_v

    elif player == 1:
        # **** Find all possible next moves for player 'O'
        moves = possible_moves(board)
        # These moves have equal probability
        # print(board, moves)
        p = 1.0 / len(moves)

        # Calculate expectation of these moves;
        # Player 'O' chooses one of them randomly.
        Rx = 0.0        # reward from the perspective of 'X'
        for m in moves:
            new_board = board.copy()
            new_board[m] = 1        # Player 'O'

            # If this an ending move?
            r = game_over(new_board, 1)
            if r is not None:
                if r == 10:             # draw is +10 for either player
                    Rx += r * p
                else:
                    Rx += - r * p       # sign of reward is reversed
            else:
                v = expectation(new_board, -1)
                Rx += v * p
        # show_board(board)
        print("O's turn.  Expectation w.r.t. Player X =", Rx, end='\r')
        return Rx

def possible_moves(board):
    moves = []
    for i in range(9):
        if board[i] == 0:
            moves.append(i)
    return moves

# Check only for the given player.
# Return reward w.r.t. the specific player.
def game_over(board, player):
    # check horizontal
    for i in [0, 3, 6]:     # for each row
        if board[i + 0] == player and \
           board[i + 1] == player and \
           board[i + 2] == player:
            return 20

    # check vertical
    for j in [0, 1, 2]:     # for each column
        if board[3 * 0 + j] == player and \
           board[3 * 1 + j] == player and \
           board[3 * 2 + j] == player:
            return 20

    # check diagonal
    if board[0 + 0] == player and \
       board[3 * 1 + 1] == player and \
       board[3 * 2 + 2] == player:
        return 20

    # check backward diagonal
    if board[3 * 0 + 2] == player and \
       board[3 * 1 + 1] == player and \
       board[3 * 2 + 0] == player:
        return 20

    # return None if game still open
    for i in [0, 3, 6]:
        for j in [0, 1, 2]:
            if board[i + j] == 0:
                return None

    # For one version of gym TicTacToe, draw = 10 regardless of player;
    # Another way is to assign draw = 0.
    return 10

print("\u001b[2K\nOptimal value =", expectation(test_board, -1) )

示例输出（轮到 X 播放）：

其它你可能感兴趣的问题

上一篇直接评估生成器分布是不可能的吗？下一篇理论上是否存在可以将 NLP 算法发送到无限循环或破坏聊天机器人的语言输入？