In this work, we focus our attention on the study of the interplay between the data distribution and Q-learning-based algorithms with function approximation. We provide a theoretical and empirical analysis as to why different properties of the data distribution can contribute to regulating sources of algorithmic instability. First, we revisit theoretical bounds on the performance of approximate dynamic programming algorithms. Second, we provide a novel four-state MDP that highlights the impact of the data distribution in the performance of a Q-learning algorithm with function approximation, both in online and offline settings. Finally, we experimentally assess the impact of the data distribution properties in the performance of an offline deep Q-network algorithm. Our results show that: (i) the data distribution needs to possess certain properties in order to robustly learn in an offline setting, namely low distance to the distributions induced by optimal policies of the MDP and high coverage over the state-action space; and (ii) high entropy data distributions can contribute to mitigating sources of algorithmic instability.
翻译:在这项工作中,我们把注意力集中在研究数据分布和基于Q-学习的算法与功能近似值之间的相互作用上。我们提供了数据分布的不同属性为何有助于调节算法不稳定的来源的理论和经验分析。首先,我们重新审视了大约动态动态编程算法绩效的理论界限。第二,我们提供了一个新的四州MDP,着重介绍了数据分布对在线和离线设置功能近似值的Q-学习算法的影响。最后,我们试验性地评估了数据分布属性对运行离线深Q-网络算法的影响。我们的结果显示:(一) 数据分布需要拥有某些属性,以便在离线环境中强有力地学习,即与MDP的最佳政策和州-行动空间高覆盖率所引发的分布距离低;以及(二) 高摄谱数据分布有助于减轻算法不稳定的源。