Recent works on neural contextual bandits have achieved compelling performances due to their ability to leverage the strong representation power of neural networks (NNs) for reward prediction. Many applications of contextual bandits involve multiple agents who collaborate without sharing raw observations, thus giving rise to the setting of federated contextual bandits. Existing works on federated contextual bandits rely on linear or kernelized bandits, which may fall short when modeling complex real-world reward functions. So, this paper introduces the federated neural-upper confidence bound (FN-UCB) algorithm. To better exploit the federated setting, FN-UCB adopts a weighted combination of two UCBs: $\text{UCB}^{a}$ allows every agent to additionally use the observations from the other agents to accelerate exploration (without sharing raw observations), while $\text{UCB}^{b}$ uses an NN with aggregated parameters for reward prediction in a similar way to federated averaging for supervised learning. Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes $\text{UCB}^{a}$ initially for accelerated exploration and relies more on $\text{UCB}^{b}$ later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation). We prove sub-linear upper bounds on both the cumulative regret and the number of communication rounds of FN-UCB, and empirically demonstrate its competitive performance.
翻译:最近关于神经背景强盗的工程取得了令人信服的成绩,因为它们有能力利用神经网络的强大代表力进行奖励预测。背景强盗的许多应用都涉及多个代理人,他们相互协作而没有分享原始观测,从而导致形成联合背景强盗的设置。联合会背景强盗的现有工作依赖于线性或内脏强盗,而当模拟复杂的现实世界奖励功能时,这些工程可能就不足了。因此,本文件介绍了联合神经增强信任(FN-UCB)算法。为了更好地利用联合环境,FN-UCB采用了两个UCB的加权组合:$\ text{UCB<unk> a}让每个代理人可以额外利用其他代理人的观察来加速勘探(不分享原始观察),而$text{UCB<unk> b}则使用一个总参数来进行奖励预测,类似于用于监督学习的节能平均值。值得注意的是,我们理论分析所需的两个UCB的重量可以进行有趣的解释,这两次UCB的加权组合组合是两个UCB的加权组合组合:美元=UText{U_B_B_B_B_B 初步展示其准确的准确的预测,并更可靠地展示了它的准确度。</s>