线性背景强盗和反反贪污腐败的近最佳最佳算法 (Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions)

from arxiv, 25 pages, 1 table. This version simplifies the proof of the regret upper bound in Version 1, and provides a stronger result for the lower bound

We study the linear contextual bandit problem in the presence of adversarial corruption, where the reward at each round is corrupted by an adversary, and the corruption level (i.e., the sum of corruption magnitudes over the horizon) is $C\geq 0$. The best-known algorithms in this setting are limited in that they either are computationally inefficient or require a strong assumption on the corruption, or their regret is at least $C$ times worse than the regret without corruption. In this paper, to overcome these limitations, we propose a new algorithm based on the principle of optimism in the face of uncertainty. At the core of our algorithm is a weighted ridge regression where the weight of each chosen action depends on its confidence up to some threshold. We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds. Thus, our algorithm is nearly optimal up to logarithmic factors for both cases. Notably, our algorithm achieves the near-optimal regret for both corrupted and uncorrupted cases ($C=0$) simultaneously.

翻译：在对抗性腐败的情况下,我们研究了线性背景土匪问题,每回合的奖赏被对手腐蚀,而腐败水平(即地平线上的腐败程度之和)为$C\geq 0美元。在这一背景下,最著名的算法是有限的,因为它们在计算上效率低下,或者要求对腐败作出强有力的假设,或者他们的遗憾至少比没有腐败的遗憾高出一倍。在本文件中,为了克服这些限制,我们提议了一种新的算法,其依据是在不确定性面前乐观的原则。在我们的算法的核心是加权的脊椎回归,其中所选择的每个行动的权重取决于其信心达到某种临界点。我们表明,对于已知的美元和未知的美元案例,我们适当选择超参数的算法都取得了几乎与较低界限相匹配的遗憾。因此,我们的算法几乎是最佳的对立因素。值得注意的是,我们的算法同时对腐败和未受干扰的案件都取得了接近最佳的遗憾。