更好地在线性强盗中使用数据 (Toward Better Use of Data in Linear Bandits)

In this paper, we study the well-known stochastic linear bandit problem where a decision-maker sequentially chooses among a set of given actions, observes their noisy reward, and aims to maximize her cumulative expected reward over a horizon of length $T$. In this paper, we first introduce a general analysis framework and a family of rate optimal algorithms for the problem. We show that this family of algorithms includes well-known algorithms such as optimism in the face of uncertainty linear bandit (OFUL) and Thompson sampling (TS) as special cases. The proposed analysis technique directly captures complexity of uncertainty in the action sets that we show is tied to regret analysis of any policy. This insight allows us to design a new rate-optimal policy, called Sieved-Greedy (SG), that reduces the over-exploration problem in existing algorithms. SG utilizes data to discard the actions with relatively low uncertainty and then choosing one among the remaining actions greedily. In addition to proving that SG is theoretically rate-optimal, our empirical simulations show that SG significantly outperforms existing benchmarks such as greedy, OFUL, and TS. Moreover, our analysis technique yields a number of new results such as obtaining poly-logarithmic (in $T$) regret bounds for OFUL and TS, under a generalized gap assumption and a margin condition, as in literature on contextual bandits. We also improve regret bounds of these algorithms for the sub-class of $k$-armed contextual bandit problems by a factor $\sqrt{k}$.

翻译：在本文中,我们研究了众所周知的随机线性匪帮问题,即决策者在一系列特定行动中依次选择了一组特定行动,观察他们的激烈奖赏,目的是在长长的1美元范围内最大限度地增加其累积的预期奖励。在本文中,我们首先采用一个总体分析框架和一系列费率最佳算法来解决这个问题。我们表明,这种算法的组合包括众所周知的算法,例如面对不确定性线性匪帮(OFL)和汤普森抽样(TS)的乐观态度,作为特例。拟议的分析技术直接抓住了行动组中不确定性的复杂性,而我们所显示的行动组与对任何政策进行令人遗憾的分析联系在一起。这一洞察让我们能够设计一个新的利率最佳政策,称为Sieeved-Greedy(SG),从而减少现有算法中的过度解释性算法问题。 SG利用数据来抛弃不确定性相对较低的行动,然后在剩下的行动中选择一种算法。除了证明SG是理论性的,我们的实算法模拟显示,SG的数值大大超出我们目前对成本值的数值基准,例如贪婪、OFIrial-roral的数值分析。