《激励探索的价格:通过汤普森抽样和抽样复杂度的特征》 (The Price of Incentivizing Exploration: A Characterization via Thompson Sampling and Sample Complexity)

We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents, and the algorithm can only issue recommendations. The algorithm controls the flow of information, and the information asymmetry can incentivize the agents to explore. Prior work achieves optimal regret rates up to multiplicative factors that become arbitrarily large depending on the Bayesian priors, and scale exponentially in the number of arms. A more basic problem of sampling each arm once runs into similar factors. We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility. We prove that Thompson Sampling, a standard bandit algorithm, is incentive-compatible if initialized with sufficiently many data points. The performance loss due to incentives is therefore limited to the initial rounds when these data points are collected. The problem is largely reduced to that of sample complexity: how many rounds are needed? We address this question, providing matching upper and lower bounds and instantiating them in various corollaries. Typically, the optimal sample complexity is polynomial in the number of arms and exponential in the "strength of beliefs".

翻译：我们考虑有激励性的探索:多武装强盗的版本,其中武器的选择由自利的代理人控制,而算法只能发布建议。算法控制信息的流动,信息不对称可以激励代理人探索。先前的工作达到最佳的遗憾率,最多可达到因巴伊西亚前科而任意扩大的倍增性因素,且武器数量成倍增加。每个手臂取样的更基本问题曾经有类似的因素。我们注重奖励的代价:为奖励兼容性而广泛解释的性能损失。我们证明,标准强势算法Thompson 抽样法如果以足够多的数据点初始化,则具有激励兼容性。因此,由于奖励而导致的业绩损失限于收集这些数据点时的最初几轮。问题基本上减少到抽样复杂性:需要多少轮?我们讨论这一问题,提供上下几轮的尺寸,并在各种滚动器库中立即进行。一般情况下,最佳的抽样复杂度是武器数量上的多元性和“指数性”。