We consider a Multi-Armed Bandit problem in which the rewards are non-stationary and are dependent on past actions and potentially on past contexts. At the heart of our method, we employ a recurrent neural network, which models these sequences. In order to balance between exploration and exploitation, we present an energy minimization term that prevents the neural network from becoming too confident in support of a certain action. This term provably limits the gap between the maximal and minimal probabilities assigned by the network. In a diverse set of experiments, we demonstrate that our method is at least as effective as methods suggested to solve the sub-problem of Rotting Bandits, and can solve intuitive extensions of various benchmark problems. We share our implementation at https://github.com/rotmanmi/Energy-Regularized-RNN.
翻译:我们考虑一个多臂赌博问题,其中奖励是非平稳的,且取决于过去的动作和可能的上下文。我们的方法的核心是采用递归神经网络,建模这些序列。为了平衡探索和利用,我们引入了一个能量最小化项,防止神经网络在支持某个动作上过于自信。该项可以证明限制了网络分配概率的最大值和最小值之间的差距。在各种实验中,我们证明了我们的方法至少与解决ROTTING BANDITS子问题的方法一样有效,并且可以解决各种基准问题的直觉扩展。我们在https://github.com/rotmanmi/Energy-Regularized-RNN上共享了我们的实现。