Multi-player multi-armed bandit is an increasingly relevant decision-making problem, motivated by applications to cognitive radio systems. Most research for this problem focuses exclusively on the settings that players have \textit{full access} to all arms and receive no reward when pulling the same arm. Hence all players solve the same bandit problem with the goal of maximizing their cumulative reward. However, these settings neglect several important factors in many real-world applications, where players have \textit{limited access} to \textit{a dynamic local subset of arms} (i.e., an arm could sometimes be ``walking'' and not accessible to the player). To this end, this paper proposes a \textit{multi-player multi-armed walking bandits} model, aiming to address aforementioned modeling issues. The goal now is to maximize the reward, however, players can only pull arms from the local subset and only collect a full reward if no other players pull the same arm. We adopt Upper Confidence Bound (UCB) to deal with the exploration-exploitation tradeoff and employ distributed optimization techniques to properly handle collisions. By carefully integrating these two techniques, we propose a decentralized algorithm with near-optimal guarantee on the regret, and can be easily implemented to obtain competitive empirical performance.
翻译:多玩家多武装盗匪是一个越来越重要的决策问题,其动机是应用认知无线电系统。 这个问题的大多数研究都完全集中在玩家拥有\ textit{ 完全接触所有武器且拉动同一手臂时得不到奖励的设置上。 因此,所有玩家都解决了同一个盗匪问题,目标是最大限度地增加累积奖励。 然而,这些设置忽视了许多现实应用中的若干重要因素,在这个应用中,玩家拥有\ textit{ 有限准入} 获得\ textit{ a 动态局部武器子集} (即手臂有时可以“行走 ”, 而玩家无法进入 。 为此,本文建议采用“ 拖动{ 多玩家多武装行劫匪” 模式,以解决上述建模问题。 但是,现在,这些玩家只能从本地子中拉动武器,而如果没有其他玩家拉动同一手臂,则只能获得全额奖赏。 我们采用超信任(UCound) 来处理勘探交易, 并使用分布的优化技术来适当处理碰撞问题。 通过仔细的整合, 我们提出一种分散式的、 和灵活的演算法, 谨慎地将这两种方法纳入。