Motivated by practical needs of experimentation and policy learning in online platforms, we study the problem of safe data collection. Specifically, our goal is to develop a logging policy that efficiently explores different actions to elicit information while achieving competitive reward with a baseline production policy. We first show that a common practice of mixing the production policy with randomized exploration, despite being safe, is sub-optimal in maximizing information gain. Then, we propose a safe optimal logging policy via a novel water-filling technique for the case when no side information about the actions' expected reward is available. We improve upon this design by considering side information and also extend our approaches to the linear contextual model to account for a large number of actions. Along the way, we analyze how our data logging policies impact errors in off(line)-policy learning and empirically validate the benefit of our design by conducting extensive numerical experiments with synthetic and MNIST datasets. To further demonstrate the generality of our approach, we also consider the safe online learning setting. By adaptively applying our techniques, we develop the Safe Phased-Elimination (SafePE) algorithm that can achieve optimal regret bound with only logarithmic number of policy updates.
翻译:以在线平台实验和政策学习的实际需要为动力,我们研究安全数据收集问题。具体地说,我们的目标是制定一项伐木政策,有效探索不同行动,以获取信息,同时以基线生产政策取得竞争性奖励。我们首先表明,将生产政策与随机勘探相结合的共同做法尽管是安全的,但在尽可能扩大信息收益方面并不理想。然后,我们提出一种安全的最佳采伐政策,即:在没有关于预期行动奖励的侧面信息的情况下,采用一种新的补水技术,为案件提供一种安全的最佳采伐政策。我们通过考虑侧面信息,改进这一设计,并将我们的方法推广到线性背景模型,以核算大量行动。与此同时,我们分析我们的数据采伐政策如何影响(在线)政策学习中的错误,并通过与合成和MNIST数据集进行广泛的数字实验,实证地验证我们设计的好处。为了进一步展示我们方法的一般性,我们还考虑安全在线学习设置。我们通过适应性地应用我们的技术,我们开发了安全分阶段根除(Safepe)算算法,只有对政策数字进行最佳遗憾。