EE-Net:内地强盗中的剥削-剥削神经网络 (EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits)

In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose "EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.

翻译：在本文中,我们提出了与标准 UCB 和 TS- 基础方法不同的背景强盗( EE-Net ) 新型神经勘探战略。几十年来,我们用各种应用对背景多武装强盗进行了研究。要解决土匪的剥削-探索交易,我们提出了三种主要技术: Epsilon-greedy, Thompson Sampling (TS) 和 Up Infority Bound (UCB) 。在最近的文献中, 线性背景强盗采用了山脊回归法来估计奖赏功能, 并将其与TS或 UCB的勘探战略结合起来。但是, 这行显然假设奖励是基于武装矢量的线性功能, 而在现实世界的数据集中可能并不是这样。为了克服这一挑战, 提出了一系列神经强盗算算法, 用来学习基本奖赏功能以及TSe或UCB( UCB) 。我们建议用“ E-Net 新的线性勘探战略 ” 来计算一个基于大量降税的税。除了使用电路网络, 来学习电压网络, 来学习电路变网络, 然后的网络, 来学习电算( 学习电路压网络, 学习电路压网络, 学习电路路路路的网络, 学习网络, 学习电路路路路的计算。