We address the challenge of policy evaluation in real-world applications of reinforcement learning systems where the available historical data is limited due to ethical, practical, or security considerations. This constrained distribution of data samples often leads to biased policy evaluation estimates. To remedy this, we propose that instead of policy evaluation, one should perform policy comparison, i.e. to rank the policies of interest in terms of their value based on available historical data. In addition we present the Limited Data Estimator (LDE) as a simple method for evaluating and comparing policies from a small number of interactions with the environment. According to our theoretical analysis, the LDE is shown to be statistically reliable on policy comparison tasks under mild assumptions on the distribution of the historical data. Additionally, our numerical experiments compare the LDE to other policy evaluation methods on the task of policy ranking and demonstrate its advantage in various settings.
翻译:在由于道德、实践或安全考虑,现有历史数据有限的情况下,在强化学习系统的实际应用中,我们应对政策评价的挑战,因为现有的历史数据因伦理、实践或安全考虑而受到限制。这种数据样本的有限分布往往导致政策评价估计偏颇。为了纠正这一点,我们建议,应当进行政策比较,而不是政策评价,即根据现有的历史数据,根据政策价值对政策政策评价进行评级。此外,我们把有限数据估计器(LDE)作为评估和比较少数与环境互动的政策的简单方法。根据我们的理论分析,LDE在政策比较任务方面,在对历史数据分配的轻度假设下,其统计上是可靠的。此外,我们进行的数字实验将LDE与其他政策评价方法相比较,表明它在各种环境中的优势。