We study the survival bandit problem, a variant of the multi-armed bandit problem with a constraint on the cumulative reward; at each time step, the agent receives a reward in [-1, 1] and if the cumulative reward becomes lower than a preset threshold, the procedure stops, and this phenomenon is called ruin. To our knowledge, this is the first paper studying a framework where the ruin might occur but not always. We first discuss that no policy can achieve a sublinear regret as defined in the standard multi-armed bandit problem, because a single pull of an arm may increase significantly the risk of ruin. Instead, we establish the framework of Pareto-optimal policies, which is a class of policies whose cumulative reward for some instance cannot be improved without sacrificing that for another instance. To this end, we provide tight lower bounds on the probability of ruin, as well as matching policies called EXPLOIT. Finally, using a doubling trick over an EXPLOIT policy, we display a Pareto-optimal policy in the case of {-1, 0, 1} rewards, giving an answer to the open problem by Perotto et al. (2019).
翻译:我们研究的是生存强盗问题,这是多武装强盗问题的一个变体,它制约了累积的奖赏;在每一个步骤上,代理人在[-1,1]中都得到奖励,如果累积的奖赏低于预设的门槛,程序就停止,而这种现象就被称为毁灭。据我们所知,这是研究一个可能发生毁灭但并不总是发生的框架的第一份文件。我们首先讨论的是,没有任何政策能够实现标准多武装强盗问题定义的亚线性遗憾,因为单拉一只手臂可能会大大增加破坏的风险。相反,我们建立了帕雷托最佳政策的框架,这是一种政策,其累积的奖赏在某些情况下是无法在不牺牲另一例子的情况下得到改进的。为此,我们为废墟的概率提供了严格较低的界限,以及称为EXBIPIT的匹配政策。最后,我们用比EXBIPIT政策双倍的把戏,在{-1,0,1}奖励中展示了帕雷托最佳政策,对普雷托等人的公开问题给出了答案(2019年)。