We propose a \underline{d}oubly \underline{o}ptimistic strategy for the \underline{s}afe-\underline{l}inear-\underline{b}andit problem, DOSLB. The safe linear bandit problem is to optimise an unknown linear reward whilst satisfying unknown round-wise safety constraints on actions, using stochastic bandit feedback of reward and safety-risks of actions. In contrast to prior work on aggregated resource constraints, our formulation explicitly demands control on roundwise safety risks. Unlike existing optimistic-pessimistic paradigms for safe bandits, DOSLB exercises supreme optimism, using optimistic estimates of reward and safety scores to select actions. Yet, and surprisingly, we show that DOSLB rarely takes risky actions, and obtains $\tilde{O}(d \sqrt{T})$ regret, where our notion of regret accounts for both inefficiency and lack of safety of actions. Specialising to polytopal domains, we first notably show that the $\sqrt{T}$-regret bound cannot be improved even with large gaps, and then identify a slackened notion of regret for which we show tight instance-dependent $O(\log^2 T)$ bounds. We further argue that in such domains, the number of times an overly risky action is played is also bounded as $O(\log^2T)$.
翻译:我们为下线{底线{底线{底线}{底线{底线{底线{底线{底线{底线}提出一个战略,以优化一个未知线性奖赏,同时利用奖赏和安全风险的随机回馈,满足行动方面未知的圆向安全限制。与以往关于总体资源限制的工作相比,我们的配方明确要求对圆形安全风险进行控制。与目前对安全匪徒的乐观和悲观范式不同,DOSLB使用对奖赏和安全分数的乐观估计来选择行动DOSLB。然而,令人惊讶的是,我们显示DOSLB很少采取冒险行动,并获得对行动进行无风险的圆向安全限制,同时使用奖赏和安全风险的随机的匪帮反馈。我们对于效率低下和缺乏安全性的行动的遗憾概念,我们特别针对多式区域,我们首先明显显示,美元=基调值-平差值-平分比值-平底线上,我们没有多少的硬度概念。