Deep Q-learning based algorithms have been applied successfully in many decision making problems, while their theoretical foundations are not as well understood. In this paper, we study a Fitted Q-Iteration with two-layer ReLU neural network parametrization, and find the sample complexity guarantees for the algorithm. The approach estimates the Q-function in each iteration using a convex optimization problem. We show that this approach achieves a sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^{2})$, which is order-optimal. This result holds for a countable state-space and does not require any assumptions such as a linear or low rank structure on the MDP.
翻译:许多决策问题中都成功地应用了基于深Q学习的算法,而它们的理论基础却不十分清楚。 在本文中,我们研究了一种配有双层 ReLU 神经网络参数的适合的 Q 外推法, 并找到了算法的样本复杂性保证。 这种方法使用一个 convex 优化问题来估计每次迭代中的 Q 功能 。 我们显示, 这种方法达到了 $\ tilde\ mathcal{O} (1/\\\ epsilon\2}) $( ) 的样本复杂性, 这是最优的顺序 。 这一结果可以容纳一个可计算的国家空间, 不需要任何假设, 如 MDP 的线性或低级结构 。