The linear bandit problem has been studied for many years in both stochastic and adversarial settings. Designing an algorithm that can optimize the environment without knowing the loss type attracts lots of interest. \citet{LeeLWZ021} propose an algorithm that actively detects the loss type and then switches between different algorithms specially designed for different settings. However, such an approach requires meticulous designs to perform well in all settings. Follow-the-regularized-leader (FTRL) is another popular algorithm type that can adapt to different environments. This algorithm is of simple design and the regret bounds are shown to be optimal in traditional multi-armed bandit problems compared with the detect-switch type algorithms. Designing an FTRL-type algorithm for linear bandits is an important question that has been open for a long time. In this paper, we prove that the FTRL-type algorithm with a negative entropy regularizer can achieve the best-of-three-world results for the linear bandit problem with the tacit cooperation between the choice of the learning rate and the specially designed self-bounding inequality.
翻译:线性土匪问题已经研究多年, 无论是在随机和对抗性设置中。 设计一种可以在不知道损失类型的情况下优化环境的算法会吸引很多兴趣 。\citet{LeeLWZ021} 提议一种积极检测损失类型的算法, 然后在为不同设置而专门设计的不同算法之间交换。 但是, 这样一种方法需要在所有设置中进行精确的设计才能很好地运行。 跟踪定时主机( FTRL) 是另一个能够适应不同环境的受欢迎的算法类型。 这种算法是简单的设计, 与探测开关类型算法相比, 在传统的多臂土匪问题中, 遗憾界限被证明是最佳的。 为线性土匪设计一种FTRL型算法是一个重要问题, 这个问题已经很长时间了。 在本文中, 我们证明, 带有负的 entropy manizer 的FTRL型算法可以实现线性波段问题的最佳三世界结果, 与选择学习率和专门设计的自我约束不平等之间的默认合作。</s>