We study the problem of distributed stochastic multi-arm contextual bandit with unknown contexts, in which M agents work collaboratively to choose optimal actions under the coordination of a central server in order to minimize the total regret. In our model, an adversary chooses a distribution on the set of possible contexts and the agents observe only the context distribution and the exact context is unknown to the agents. Such a situation arises, for instance, when the context itself is a noisy measurement or based on a prediction mechanism as in weather forecasting or stock market prediction. Our goal is to develop a distributed algorithm that selects a sequence of optimal actions to maximize the cumulative reward. By performing a feature vector transformation and by leveraging the UCB algorithm, we propose a UCB algorithm for stochastic bandits with context distribution and prove that our algorithm achieves a regret and communications bounds of $O(d\sqrt{MT}log^2T)$ and $O(M^{1.5}d^3)$, respectively, for linearly parametrized reward functions. We also consider a case where the agents observe the actual context after choosing the action. For this setting we presented a modified algorithm that utilizes the additional information to achieve a tighter regret bound. Finally, we validated the performance of our algorithms and compared it with other baseline approaches using extensive simulations on synthetic data and on the real world movielens dataset.
翻译:我们研究的是分布式随机多武器背景土匪的问题,其背景不明,M代理商在中央服务器的协调下合作选择最佳行动,以尽量减少全部遗憾。在我们的模型中,对手选择了一组可能背景的分布,代理商只观察背景分布和确切背景,例如,当环境本身是一个噪音的测量或基于天气预报或股票市场预测的预测机制,我们的目标是开发一种分布式算法,选择一系列最佳行动,以尽量扩大累积报酬。通过进行特性矢量变换和利用UCB算法,我们建议对背景分布的随机强盗采用UCB算法,并证明我们的算法取得了分别为$O(d\sqrt{MT}log}2T)和$$O(M ⁇ 1.5}d ⁇ 3)和$O(M ⁇ 1.5}d ⁇ 3)的遗憾和通信界限,用于线性准化的奖赏功能。我们还考虑一个案例,让代理商在选择行动后观察实际背景。我们用真实矢量矢量的矢量算法提出一个UC算,我们最后用更严格地核对了其他数据。