Confidence intervals are a crucial building block in the analysis of various online learning problems. The analysis of kernel based bandit and reinforcement learning problems utilize confidence intervals applicable to the elements of a reproducing kernel Hilbert space (RKHS). However, the existing confidence bounds do not appear to be tight, resulting in suboptimal regret bounds. In fact, the existing regret bounds for several kernelized bandit algorithms (e.g., GP-UCB, GP-TS, and their variants) may fail to even be sublinear. It is unclear whether the suboptimal regret bound is a fundamental shortcoming of these algorithms or an artifact of the proof, and the main challenge seems to stem from the online (sequential) nature of the observation points. We formalize the question of online confidence intervals in the RKHS setting and overview the existing results.
翻译:信任间隔是分析各种在线学习问题的关键基石。对内核土土匪和强化学习问题的分析利用适用于复制核心Hilbert空间(RKHS)元素的信任间隔。然而,现有的信任界限似乎并不紧凑,导致不尽人意的遗憾界限。事实上,若干内核强盗算法(如GP-UCB、GP-TS及其变种)的现有遗憾界限甚至可能无法成为亚线性。尚不清楚亚最佳遗憾界限是这些算法的基本缺陷还是证据的文物,主要挑战似乎来自观察点的在线(顺序)性质。我们在RKHS设定和概述现有结果时将在线信任间隔问题正式化。