加快实时投标和建议书中的离线强化学习应用:模拟的可能使用 (Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation)

In recommender systems (RecSys) and real-time bidding (RTB) for online advertisements, we often try to optimize sequential decision making using bandit and reinforcement learning (RL) techniques. In these applications, offline reinforcement learning (offline RL) and off-policy evaluation (OPE) are beneficial because they enable safe policy optimization using only logged data without any risky online interaction. In this position paper, we explore the potential of using simulation to accelerate practical research of offline RL and OPE, particularly in RecSys and RTB. Specifically, we discuss how simulation can help us conduct empirical research of offline RL and OPE. We take a position to argue that we should effectively use simulations in the empirical research of offline RL and OPE. To refute the counterclaim that experiments using only real-world data are preferable, we first point out the underlying risks and reproducibility issue in real-world experiments. Then, we describe how these issues can be addressed by using simulations. Moreover, we show how to incorporate the benefits of both real-world and simulation-based experiments to defend our position. Finally, we also present an open challenge to further facilitate practical research of offline RL and OPE in RecSys and RTB, with respect to public simulation platforms. As a possible solution for the issue, we show our ongoing open source project and its potential use case. We believe that building and utilizing simulation-based evaluation platforms for offline RL and OPE will be of great interest and relevance for the RecSys and RTB community.

翻译：在推荐系统(RecSys)和网上广告实时招标(RTB)中,我们常常试图利用强盗和强化学习(RL)技术优化顺序决策。在这些应用中,离线强化学习(离线RL)和离线政策评估(OPE)是有益的,因为这些应用使安全政策优化能够仅使用登录数据而无需有任何风险的在线互动。在本立场文件中,我们探索了利用模拟来加速对离线RL和OPE进行实际研究的可能性,特别是在RecSys和RTB中。具体地说,我们讨论了模拟如何帮助我们对离线RL和OPE进行实证研究。我们的立场是,我们应该在离线强化学习(下线)和非政策评估(OPL)的经验研究中有效地使用模拟。为了驳斥只使用真实世界数据进行实验是可取的反证,我们首先指出现实世界实验中的潜在风险和可复制问题。然后,我们描述了如何通过使用开放式模拟来解决这些问题。此外,我们展示如何将基于现实和模拟的实验实验实验的实验和模拟实验的好处纳入我们建立离线社区实验的实验的好处,以捍卫我们的立场。最后,我们用一个开放的真理的真理的实验研究,我们用一个可能用来展示的模型来展示的模型来展示的实验,让我们的实验,我们用在离线上展示一个可能的实验展示的实验展示的实验展示一个可能的实验展示的实验展示的实验展示的实验来展示的实验展示一个在现实的实验展示的实验展示的实验展示的实验的实验的实验展示的实验展示的实验的实验展示一个我们用在现实的实验的实验的实验性的实验的实验的实验的实验性的实验展示我们用在不展示我们用在不的实验的实验的实验的实验的实验的实验的实验的实验的实验的实验的实验的实验的实验展示我们用在现实的实验的实验的实验展示的实验展示的实验的实验的实验的实验展示的实验的实验展示我们用的实验的实验的实验展示我们用在现实的实验的实验的实验的实验展示我们用展示我们用在不展示我们的实验的实验的实验的实验