人工智能 - Q学习井字游戏 - 吾爱随笔录

我有一个带有 Q 学习算法的井字游戏，人工智能与相同的算法对抗（但它们不共享相同的 Q 矩阵）。但是玩了20万局之后，我还是很轻松的打败了AI，而且还是比较笨的。我的选择是由 epsilon 贪婪策略做出的。

什么可能导致人工智能不学习？

[编辑]
这是我的做法（伪代码）：

for(int i = 0; i < 200000; ++i){
    //Game is restarted here
    ticTacToe.play();
}

在我的 ticTacToe 中，我有一个简单的循环：

while(!isFinished()){
    swapPlaying(); //Change the players' turn
    Position toPlay = playing.whereToMove();

    applyPosition(toPlay);
    playing.update(toPlay);
}

//Here I just update my players whether they won, draw or lost.

在我的玩家中，我选择了以下 epsilon-greedy 实施的 sa 移动：

Moves moves = getMoves(); // Return every move available
Qvalues qValues = getQValues(moves); // return only qvalues of interest
//also create the state and add it to the Q-matrix if not already in.

if(!optimal) {
     updateEpsilon(); //I update epsilon with simple linear function epsilon = 1/k, with k being the number of games played.
     double r = (double) rand() / RAND_MAX; // Random between 0 and 1
     if(r < epsilon) { //Exploration
         return randomMove(moves); // Selection of a random move among every move available.
     }
     else {
         return moveWithMaxQValue(qValues);
     }
} else { // If I'm not in the training part anymore
     return moveWithMaxQValue(qValues);
  }

我更新了以下内容：

double reward = getReward() // Return 1 if game won, -1 if game lost, 0 otherwise
double thisQ, maxQ, newQ;
Grid prevGrid = Grid(*grid); //I have a shared_ptr on the grid for simplicity
prevGrid.removeAt(position) // We remove the action executed before

string state = stateToString(prevGrid);
thisQ = qTable[state][action];
mawQ = maxQValues();

newQ = thisQ + alpha * (reward + gamma*maxQ - thisQ);
qTable[state][action] = newQ;

如上所述，两个 AI 具有相同的算法，但它们是两个不同的实例，因此它们没有相同的 Q 矩阵。我在 Stack Overflow 上的某处读到我应该考虑对方玩家的移动，但我在玩家移动和对手移动后更新状态，所以我认为没有必要。