We consider the neural contextual bandit problem. In contrast to the existing work which primarily focuses on ReLU neural nets, we consider a general set of smooth activation functions. Under this more general setting, (i) we derive non-asymptotic error bounds on the difference between an overparameterized neural net and its corresponding neural tangent kernel, (ii) we propose an algorithm with a provably sublinear regret bound that is also efficient in the finite regime as demonstrated by empirical studies. The non-asymptotic error bounds may be of broader interest as a tool to establish the relation between the smoothness of the activation functions in neural contextual bandits and the smoothness of the kernels in kernel bandits.
翻译:我们考虑的是神经环境强盗问题。与目前主要侧重于ReLU神经网的工作相比,我们考虑的是一套一般的平稳激活功能。在这个更为笼统的环境下,(一) 我们得出非无线误差的界限,其界限是超分神经网与其相应的神经相近内核之间的差别,(二) 我们提出一种算法,其可辨别的次线性遗憾约束在有限制度中也是有效的,正如经验研究所证明的那样。 非被动错误界限可能具有更广泛的意义,作为确定神经环境强盗的激活功能的平稳与内核强盗的内核光之间的关系的工具。