In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees ($T$-independent additive regrets instead of $T$-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with $O(1/T)$ normalized additive regrets ($T$-independent in the cumulative form) and validate this result through empirical evaluation.
翻译:在本文中,我们研究了一组保守的土匪问题(CBPs),有抽样处理的奖赏限制,即学习者的奖赏表现必须至少和某一基准一样。我们建议对土匪采取“一成一全”的解决办法,并将其应用到三个包含的问题,即:保守的多武装强盗(CMAB)、保守的线性强盗(CLB)和保守的环境组合强盗(CCCB)。与以前认为对预期的奖赏有高概率限制的工程不同,我们侧重于对实际得到的奖赏的样本处理限制,并实现更好的理论保障(依靠T$的独立添加剂遗憾,而不是依赖$T$独立)和实证业绩。此外,我们推广了结果,并考虑了一个新的保守的中差带问题(MV-CBPP),用预期的奖赏和变异性衡量学习成绩。关于这一长期问题,我们提供了一种新型的算法,用$O(1/T)美元标准添加剂遗憾(在累积形式上依赖T$),并通过经验评估来验证这一结果。