We study a general class of contextual bandits, where each context-action pair is associated with a raw feature vector, but the reward generating function is unknown. We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network (deep representation learning), and uses an upper confidence bound (UCB) approach to explore in the last linear layer (shallow exploration). We prove that under standard assumptions, our proposed algorithm achieves $\tilde{O}(\sqrt{T})$ finite-time regret, where $T$ is the learning time horizon. Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.
翻译:我们研究一个背景强盗一般类别, 每一个背景行动配对都与原始特性矢量相关, 但奖赏生成功能却未知。 我们提出一种新的学习算法, 利用深ReLU神经网络的最后隐藏层来改变原始特性矢量( 深层代表学习), 并使用高置信约束( UCB) 方法来探索最后线性层( 浅层探索 ) 。 我们证明根据标准假设, 我们提议的算法实现了$\ tilde{O} (\\ sqrt{T}) $- 有限时间的遗憾, 在那里, $T 是学习的时间范围 。 与现有的神经背景强盗算法相比, 我们的方法在计算上效率要高得多, 因为它只需要在深层神经网络的最后一层进行探索 。