We consider the problem where M agents collaboratively interact with an instance of a stochastic K-armed contextual bandit, where K>>M. The goal of the agents is to simultaneously minimize the cumulative regret over all the agents over a time horizon T. We consider a setting where the exact context is observed after a delay and at the time of choosing the action the agents are unaware of the context and only a distribution on the set of contexts is available. Such a situation arises in different applications where at the time of the decision the context needs to be predicted (e.g., weather forecasting or stock market prediction), and the context can be estimated once the reward is obtained. We propose an Upper Confidence Bound (UCB)-based distributed algorithm and prove the regret and communications bounds for linearly parametrized reward functions. We validated the performance of our algorithm via numerical simulations on synthetic data and real-world Movielens data.
翻译:我们考虑的是,M代理商与一个Stochistic K-armed背景土匪事件(K ⁇ M.)合作互动的问题,K ⁇ M. 代理商的目标是在时间跨度T的同时尽量减少所有代理商的累积遗憾。 我们考虑一种环境,在拖延之后,在选择行动时观察到确切的背景,代理商不知道背景,只有一组背景的分布存在。这种情况出现在不同的应用中,在作出决定时需要预测背景(例如天气预报或股票市场预测),一旦获得奖励,可以估计背景。我们提议一种基于高度信任的分布算法,并证明线性平衡奖励功能的遗憾和通信界限。我们通过合成数据和真实世界电影数据的数字模拟验证了我们的算法的性。