We study the interplay between the data distribution and Q-learning-based algorithms with function approximation. We provide a unified theoretical and empirical analysis as to how different properties of the data distribution influence the performance of Q-learning-based algorithms. We connect different lines of research, as well as validate and extend previous results. We start by reviewing theoretical bounds on the performance of approximate dynamic programming algorithms. We then introduce a novel four-state MDP specifically tailored to highlight the impact of the data distribution in the performance of Q-learning-based algorithms with function approximation, both online and offline. Finally, we experimentally assess the impact of the data distribution properties on the performance of two offline Q-learning-based algorithms under different environments. According to our results: (i) high entropy data distributions are well-suited for learning in an offline manner; and (ii) a certain degree of data diversity (data coverage) and data quality (closeness to optimal policy) are jointly desirable for offline learning.
翻译:我们用功能近似法研究数据分布法和基于Q-学习的算法之间的相互作用; 我们就数据分布的不同特性如何影响基于Q-学习的算法的性能提供统一的理论和经验分析; 我们将不同的研究线联系起来, 并验证和扩展先前的结果; 我们首先审查关于大约动态动态程序算法的性能的理论界限; 然后我们推出一个新的四州MDP, 专门设计用于突出数据分布对基于Q-学习的算法的性能的影响, 包括在线和离线功能近法。 最后, 我们实验评估数据分布特性对不同环境中两个基于Q-学习的离线算法的性能的影响。 根据我们的结果:(一) 高英特罗本数据分布非常适合离线学习;(二) 某些程度的数据多样性(数据覆盖范围)和数据质量(对最佳政策的接近性)对于离线学习是共同可取的。