We study bandits and reinforcement learning (RL) subject to a conservative constraint where the agent is asked to perform at least as well as a given baseline policy. This setting is particular relevant in real-world domains including digital marketing, healthcare, production, finance, etc. For multi-armed bandits, linear bandits and tabular RL, specialized algorithms and theoretical analyses were proposed in previous work. In this paper, we present a unified framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, our framework gives a black-box reduction that turns a certain lower bound in the nonconservative setting into a new lower bound in the conservative setting. We strengthen the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP. For upper bounds, our framework turns a certain nonconservative upper-confidence-bound (UCB) algorithm into a conservative algorithm with a simple analysis. For multi-armed bandits, linear bandits and tabular RL, our new upper bounds tighten or match existing ones with significantly simpler analyses. We also obtain a new upper bound for conservative low-rank MDP.
翻译:我们研究强盗和强化学习(RL),但受保守限制,即要求代理人至少执行某项基线政策,这种环境在现实世界领域,包括数字营销、保健、生产、金融等领域特别相关。对于多武装强盗、线性强盗和表格RL,在以前的工作中提出了专门的算法和理论分析。在本文中,我们提出了一个保守强盗和RL的统一框架,我们的核心技术是计算从执行基线政策中获得的必要和足够的预算。对于下限,我们的框架提供了黑箱削减,将非保守环境中的某种较低约束转化为新的较低约束,在保守环境中,我们加强了现有的保守多武装强盗的较低约束,并为保守的线性强盗获得了新的较低界限,表格RL和低级MDP。对于上限,我们的框架将某种非保守性的上限(UCB)算法转换为一种带有简单分析的保守性算法。对于多武装强盗、线性强盗和表式RL,我们新的上限线性强盗或匹配现有的低级MRDP。我们还获得了一个新的新的新框,用于大幅度的上层保守分析。