The problem of multi-armed bandits (MAB) asks to make sequential decisions while balancing between exploitation and exploration, and have been successfully applied to a wide range of practical scenarios. Various algorithms have been designed to achieve a high reward in a long term. However, its short-term performance might be rather low, which is injurious in risk sensitive applications. Building on previous work of conservative bandits, we bring up a framework of contextual combinatorial conservative bandits. An algorithm is presented and a regret bound of $\tilde O(d^2+d\sqrt{T})$ is proven, where $d$ is the dimension of the feature vectors, and $T$ is the total number of time steps. We further provide an algorithm as well as regret analysis for the case when the conservative reward is unknown. Experiments are conducted, and the results validate the effectiveness of our algorithm.
翻译:多武装匪徒问题(MAB)要求在平衡开采和勘探之间作出顺序决定,并成功地应用于广泛的实际情景。各种算法的设计是为了长期获得高额报酬。然而,其短期性能可能相当低,对风险敏感应用有害。根据保守土匪以前的工作,我们提出了一个背景组合保守土匪的框架。提出了一种算法,并证实了$tilde O(d ⁇ 2+d\sqrt{T})的内含遗憾值,其中美元是特性矢量的维度,美元是时间步骤的总数。我们进一步提供了一种算法和遗憾分析,说明在不知道保守的奖励的情况下情况。进行了实验,结果证实了我们的算法的有效性。