The best arm identification problem in the multi-armed bandit setting is an excellent model of many real-world decision-making problems, yet it fails to capture the fact that in the real-world, safety constraints often must be met while learning. In this work we study the question of best-arm identification in safety-critical settings, where the goal of the agent is to find the best safe option out of many, while exploring in a way that guarantees certain, initially unknown safety constraints are met. We first analyze this problem in the setting where the reward and safety constraint takes a linear structure, and show nearly matching upper and lower bounds. We then analyze a much more general version of the problem where we only assume the reward and safety constraint can be modeled by monotonic functions, and propose an algorithm in this setting which is guaranteed to learn safely. We conclude with experimental results demonstrating the effectiveness of our approaches in scenarios such as safely identifying the best drug out of many in order to treat an illness.
翻译:多武装土匪环境中最好的武器识别问题是许多现实世界决策问题的一个极好模型,但它没有抓住在现实世界中,安全限制往往必须在学习时加以应对这一事实。在这项工作中,我们研究了安全临界环境中最佳武器识别问题,该代理人的目标是从许多环境中找到最安全的选择,同时探索如何确保满足某些最初未知的安全限制。我们首先在奖励和安全限制需要直线结构并显示近乎与上下限相匹配的环境下分析这一问题。我们然后分析一个更普遍的问题版本,即我们只能以单调功能来模拟奖励和安全限制,并在这一环境中提出一种算法,保证安全地学习。我们最后以实验结果来证明我们方法的有效性,例如安全地查明许多人最好的药物,以便治疗疾病。