我想每个尝试用各种强化学习解决井字游戏的人都会遇到这个问题。
答案不是“总是赢”,因为随机对手有时可能会平局。所以它比总是赢的分数略低。
我写了一个小 Python 程序来计算它。请帮助验证其正确性并通知我是否有错误或错误。
我想每个尝试用各种强化学习解决井字游戏的人都会遇到这个问题。
答案不是“总是赢”,因为随机对手有时可能会平局。所以它比总是赢的分数略低。
我写了一个小 Python 程序来计算它。请帮助验证其正确性并通知我是否有错误或错误。
这篇博文建议,在与随机对手比赛时,如果代理人先出牌,胜率是 97.8%,如果他们出牌,胜率是 79.6%(其余为平局)。
这里我假设:
结果取决于评分方案:
(此方案用于 Gym Tic Tac Toe 的一个版本)
赢=20,平=10,输=-20:
最优期望值 =
X 先播放:19.94791666666666...
O先玩:19.164021164021...
赢=20,平=0,输=-20:
最优期望值 =
X 先播放:19.89583333333333...
O 首先播放:18.497354497354....
它还有助于使用包含在代码中的一些预先播放的棋盘位置来验证程序。
这是程序:
import math # for math.inf = infinity
print("Calculate optimal expectation value of TicTacToe")
print("from the perspective of 'X' = first player.")
print("Assume both players perfectly avoid illegal moves.")
print("Player 'X' always chooses the move with maximum expectation value.")
print("Player 'O' always plays all available moves with equal probability.")
print("You may modify the initial board position in the code.")
# Empty board
test_board = 9 * [0]
# Pre-moves, if any are desired:
# X|O|
# O|O|X
# X| |
#test_board[0] = -1
#test_board[3] = 1
#test_board[6] = -1
#test_board[4] = 1
#test_board[5] = -1
#test_board[1] = 1
def show_board(board):
for i in [0, 3, 6]:
for j in range(3):
x = board[i + j]
if x == -1:
c = '❌'
elif x == 1:
c = '⭕'
else:
c = ' '
print(c, end='')
print(end='\n')
if test_board != 9 * [0]:
print("\nInitial board position:")
show_board(test_board)
# **** Calculate expectation value of input board position
def expectation(board, player):
if player == -1:
# **** Find all possible next moves for player 'X'
moves = possible_moves(board)
# Calculate expectation of these moves;
# Player 'X' will only choose the one of maximum value.
max_v = - math.inf
for m in moves:
new_board = board.copy()
new_board[m] = -1 # Player 'X'
# If this an ending move?
r = game_over(new_board, -1)
if r is not None:
if r > max_v:
max_v = r
else:
v = expectation(new_board, 1)
if v > max_v:
max_v = v
# show_board(board)
print("X's turn. Expectation w.r.t. Player X =", max_v, end='\r')
return max_v
elif player == 1:
# **** Find all possible next moves for player 'O'
moves = possible_moves(board)
# These moves have equal probability
# print(board, moves)
p = 1.0 / len(moves)
# Calculate expectation of these moves;
# Player 'O' chooses one of them randomly.
Rx = 0.0 # reward from the perspective of 'X'
for m in moves:
new_board = board.copy()
new_board[m] = 1 # Player 'O'
# If this an ending move?
r = game_over(new_board, 1)
if r is not None:
if r == 10: # draw is +10 for either player
Rx += r * p
else:
Rx += - r * p # sign of reward is reversed
else:
v = expectation(new_board, -1)
Rx += v * p
# show_board(board)
print("O's turn. Expectation w.r.t. Player X =", Rx, end='\r')
return Rx
def possible_moves(board):
moves = []
for i in range(9):
if board[i] == 0:
moves.append(i)
return moves
# Check only for the given player.
# Return reward w.r.t. the specific player.
def game_over(board, player):
# check horizontal
for i in [0, 3, 6]: # for each row
if board[i + 0] == player and \
board[i + 1] == player and \
board[i + 2] == player:
return 20
# check vertical
for j in [0, 1, 2]: # for each column
if board[3 * 0 + j] == player and \
board[3 * 1 + j] == player and \
board[3 * 2 + j] == player:
return 20
# check diagonal
if board[0 + 0] == player and \
board[3 * 1 + 1] == player and \
board[3 * 2 + 2] == player:
return 20
# check backward diagonal
if board[3 * 0 + 2] == player and \
board[3 * 1 + 1] == player and \
board[3 * 2 + 0] == player:
return 20
# return None if game still open
for i in [0, 3, 6]:
for j in [0, 1, 2]:
if board[i + j] == 0:
return None
# For one version of gym TicTacToe, draw = 10 regardless of player;
# Another way is to assign draw = 0.
return 10
print("\u001b[2K\nOptimal value =", expectation(test_board, -1) )
示例输出(轮到 X 播放):