Contextual bandits aim to identify among a set of arms the optimal one with the highest reward based on their contextual information. Motivated by the fact that the arms usually exhibit group behaviors and the mutual impacts exist among groups, we introduce a new model, Arm Group Graph (AGG), where the nodes represent the groups of arms and the weighted edges formulate the correlations among groups. To leverage the rich information in AGG, we propose a bandit algorithm, AGG-UCB, where the neural networks are designed to estimate rewards, and we propose to utilize graph neural networks (GNN) to learn the representations of arm groups with correlations. To solve the exploitation-exploration dilemma in bandits, we derive a new upper confidence bound (UCB) built on neural networks (exploitation) for exploration. Furthermore, we prove that AGG-UCB can achieve a near-optimal regret bound with over-parameterized neural networks, and provide the convergence analysis of GNN with fully-connected layers which may be of independent interest. In the end, we conduct extensive experiments against state-of-the-art baselines on multiple public data sets, showing the effectiveness of the proposed algorithm.
翻译:由于武器通常展示群体行为和群体间相互影响,我们采用了一种新的模式,即Arm Group Graph(AGG),节点代表各军火集团,加权边缘代表各集团间的相互关系。为了利用AGG的丰富信息,我们提议采用AGG-UCB(AGG-UCB)的强盗算法(AGG-UCB),让神经网络用来估计报酬,我们提议利用图形神经网络(GNN)了解各武装团体与相关关系之间的表现。为了解决强盗的剥削-勘探两难困境,我们采用了一种新的模式,即Arm Group Grap(AGGG),在这个模式中,节点代表了各集团间的军火集团,而加权边际关系形成了相互联系。此外,我们证明AGGG-UCB(AG-UCB)可以实现接近最优化的遗憾,将超标定的神经网络捆绑在一起,并提供GNN和完全相连的层的一致分析,而这些层可能具有独立兴趣。在最后,我们针对多个公共数据序列的状态基线进行广泛的实验。我们针对多个公共数据组进行广泛的实验。