人工智能 - 拟合的 Q 迭代算法与问*(小号,一)Q∗(s,a)，我们如何在这个算法中使用函数逼近？ - 吾爱随笔录

我希望得到关于拟合 Q 迭代 (FQI) 的一些说明。

我的研究到目前为止

我读过Sutton 的书（特别是第 6 章到第 10 章）、Ernst 等人和这篇论文。

我知道 $Q^*(s, a)$ 表示首先采取行动的期望值 $a$ 从状态 $s$ 然后永远遵循最优策略。

我尽力理解大状态空间中的函数逼近和 TD( $n$ ）。

我的问题

概念 - 有人可以解释如何从 1 迭代扩展 N 直到停止条件达到最优（Ernst 等人的第 3.5 节）背后的直觉吗？我很难理解这与基本定义有何联系 $Q^*(s, a)$ 我在上面说过。
实施 - Ernst 等人。给出表格形式的伪代码. 但是如果我尝试实现函数逼近形式，这是否正确：

Repeat until stopping conditions are reached:
    - N ← N + 1
    - Build the training set TS based on the function Q^{N − 1} and on the full set of four-tuples F 

    - Train the algorithm on the TS

    - Use the trained model to predict on the TS itself

    - Create TS for the next N by updating the labels - new reward plus (gamma * predicted values )

作为我课程的一部分，我刚刚开始学习 RL。因此，我的理解存在许多空白。希望得到一些善意的指导。