Multi-arm bandit (MAB) is a classic online learning framework that studies the sequential decision-making in an uncertain environment. The MAB framework, however, overlooks the scenario where the decision-maker cannot take actions (e.g., pulling arms) directly. It is a practically important scenario in many applications such as spectrum sharing, crowdsensing, and edge computing. In these applications, the decision-maker would incentivize other selfish agents to carry out desired actions (i.e., pulling arms on the decision-maker's behalf). This paper establishes the incentivized online learning (IOL) framework for this scenario. The key challenge to design the IOL framework lies in the tight coupling of the unknown environment learning and asymmetric information revelation. To address this, we construct a special Lagrangian function based on which we propose a socially-optimal mechanism for the IOL framework. Our mechanism satisfies various desirable properties such as agent fairness, incentive compatibility, and voluntary participation. It achieves the same asymptotic performance as the state-of-art benchmark that requires extra information. Our analysis also unveils the power of crowd in the IOL framework: a larger agent crowd enables our mechanism to approach more closely the theoretical upper bound of social performance. Numerical results demonstrate the advantages of our mechanism in large-scale edge computing.
翻译:多武器土匪(MAB)是一个典型的在线学习框架,在不确定的环境中研究顺序决策。但是,MAB框架忽略了决策者无法直接采取行动(例如拉动武器)的情景。这是频谱共享、人群测量和边缘计算等许多应用中一个实际重要的情景。在这些应用中,决策者将激励其他自私的代理人采取预期行动(即代表决策者拉动武器)。本文为这一情景确立了激励性在线学习框架。设计IOL框架的关键挑战在于:不为人知的环境学习和不对称信息披露的紧密结合。为了解决这一问题,我们建立了一个特殊的Lagrangian功能,据此我们提出了IOL框架的社会优化机制。我们的机制满足了各种可取的特性,如代理公平、激励兼容性和自愿参与。它实现了与需要额外信息的国家标准相同的激励性业绩。我们的分析还揭示了我们高水平水平的IOL业绩机制 — — 更能展示我们高水平水平的IOL业绩机制的高级性能。