This paper aims to put forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for an unbounded number of exploratory trials, provided that one is willing to relax its optimality requirements mildly. We focus on the canonical multi-armed bandit problem and seek to study the exploration-preservation trade-off intrinsic within safe learning. More precisely, by defining a handicap metric that counts the number of unsafe actions, we provide an algorithm for discarding unsafe machines (or actions), with probability one, that achieves constant handicap. Our algorithm is rooted in the classical sequential probability ratio test, redefined here for continuing tasks. Under standard assumptions on sufficient exploration, our rule provably detects all unsafe machines in an (expected) finite number of rounds. The analysis also unveils a trade-off between the number of rounds needed to secure the environment and the probability of discarding safe machines. Our decision rule can wrap around any other algorithm to optimize a specific auxiliary goal since it provides a safe environment to search for (approximately) optimal policies. Simulations corroborate our theoretical findings and further illustrate the aforementioned trade-offs.
翻译:本文旨在提出这样的概念,即学会在未知环境中采取安全行动,即使概率为一种保障,也可以在不限制次数的试探性试验的情况下实现安全行动,只要人们愿意放松其最佳性要求,只要他们愿意温和地放松其最佳性要求。我们关注卡通式多武装土匪问题,并寻求研究安全学习内在的勘探-保全权衡。更准确地说,通过界定一个计算不安全行动次数的缺陷度量,我们为丢弃不安全机器(或行动)提供了一种算法,有可能达到长期性障碍。我们的算法根植于传统的连续概率比率测试,并在此为持续任务重新定义。根据关于充分勘探的标准假设,我们的规则可以灵活地探测出一个(预期的)有限数目的不安全机器。分析还揭示了确保环境所需的轮数和丢弃安全机器的可能性之间的权衡。我们的决定可以包罗出任何其他算法,优化具体的辅助目标,因为它提供了寻找(大约)最佳政策的安全环境。模拟了我们的理论结论,并进一步说明上述权衡。