The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown round-wise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study aggressive \emph{doubly-optimistic play} in SLBs, and their role in avoiding the strong assumptions and poor efficacy associated with extant pessimistic-optimistic solutions. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret and safety violations due to an inability to refine the location of optimal actions to arbitrary precision. In a positive direction, we propose and analyse a doubly-optimistic confidence-bound based strategy for the safe linear bandit problem, DOSLB, which exploits supreme optimism by using optimistic estimates of both reward and safety risks to select actions. Using a novel dual analysis, we show that despite the lack of knowledge of constraints, DOSLB rarely takes overly risky actions, and obtains tight instance-dependent $O(\log^2 T)$ bounds on both efficacy regret and net safety violations up to any finite precision, thus yielding large efficacy gains at a small safety cost and without strong assumptions. Concretely, we argue that algorithm activates noisy versions of an `optimal' set of constraints at each round, and activation of suboptimal sets of constraints is limited by the larger of a safety and efficacy gap we define.
翻译:安全线性土匪问题( SLB) 是在线线性编程的一种方法, 其目标不明,圆环限制也不为人知。 我们研究的是SLB的激进 \ emph{doubly-optimatic play}, 及其在避免与现存悲观-乐观解决方案相关的强势假设和低效率方面的作用。 我们首先对SLB缺乏约束性知识, 阐明SLB的内在硬性: 存在“ 容易的” 实例, 其次优极端点有较大的“ 差距 ”, 但对于SLB 方法仍必须为此产生$\ Omega (\\\ sqrt{T}) 的“ 风险效率” 。 我们研究的是, 由于无法将最佳行动的位置调整为任意精确性, 以及它们的作用是避免了强势性 。 我们提出并分析一个以更乐观的基于信任为基础的战略, DOSLB, 通过使用对奖赏和安全风险的亚极性估算来选择行动。, 新的双重分析, 我们通过缺乏成本性精确性 的精确性 和 的精确性, 因此, 我们通过缺乏 亚 的 的 的 的 的 度 的 度 的 度 的 度 的 度 度 的 的 的 的 度 度, 我们的 亚性 的 的 的 的 的 度 的, 我们的 。