We consider the off-policy evaluation problem of reinforcement learning using deep neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high representation dimensionality. Specifically, we establish a sharp error bound for the fitted Q-evaluation that depends on the intrinsic low dimension, the smoothness of the state-action space, and a function class-restricted $\chi^2$-divergence. It is noteworthy that the restricted $\chi^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. Numerical experiments are provided to support our theoretical analysis.
翻译:我们考虑的是利用深层神经网络强化学习的非政策性评估问题。当数据来自未知行为政策时,我们分析了用于估计目标政策的预期累积回报的深齐备的Q评价方法;我们表明,通过适当选择网络规模,我们可以在Markov决策过程中利用低维的多元结构,获得一个不受高代表度诅咒影响的抽样高效估量器。具体地说,我们为适合的Q评价设置了一个尖锐的错误,该评价取决于内在的低维度、国家行动空间的顺利性以及功能等级限制的$\chi ⁇ 2$-diverence。值得注意的是,限制的 $\chi ⁇ 2$-diverence 测量了功能空间中的行为和目标政策之间的不匹配值,即使两种政策在表格形式上并不接近对方,这也可能是很小的。提供了数字实验来支持我们的理论分析。