We study the problem of federated stochastic multi-arm contextual bandits with unknown contexts, in which M agents are faced with different bandits and collaborate to learn. The communication model consists of a central server and the agents share their estimates with the central server periodically to learn to choose optimal actions in order to minimize the total regret. We assume that the exact contexts are not observable and the agents observe only a distribution of the contexts. Such a situation arises, for instance, when the context itself is a noisy measurement or based on a prediction mechanism. Our goal is to develop a distributed and federated algorithm that facilitates collaborative learning among the agents to select a sequence of optimal actions so as to maximize the cumulative reward. By performing a feature vector transformation, we propose an elimination-based algorithm and prove the regret bound for linearly parametrized reward functions. Finally, we validated the performance of our algorithm and compared it with another baseline approach using numerical simulations on synthetic data and on the real-world movielens dataset.
翻译:我们研究了联邦随机多臂上下文赌博机问题,其中有M个智能体面对不同的赌博机,并协作学习。通信模型包括一个中央服务器,智能体定期与中央服务器共享其估计结果,以便选择最优动作并最小化总遗憾。 我们假设确切的上下文不可观测,智能体仅观察上下文的分布。这种情况发生在上下文本身是噪声测量或基于预测机制的情况下。 我们的目标是开发一个分布式的联邦算法,促进智能体之间的协作学习,选择最优动作序列,以最大化累积奖励。通过进行特征矢量转换,我们提出了一个基于消除的算法,并在线性参数化奖励函数的情况下证明了遗憾上限。 最后,我们通过合成数据和实际数据集movielens的数值模拟验证了算法的性能,并将其与另一基线方法进行了比较。