Consider a bandit algorithm that recommends actions to self-interested users in a recommendation system. The users are free to choose other actions and need to be incentivized to follow the algorithm's recommendations. While the users prefer to exploit, the algorithm can incentivize them to explore by leveraging the information collected from the previous users. All published work on this problem, known as incentivized exploration, focuses on small, unstructured action sets and mainly targets the case when the users' beliefs are independent across actions. However, realistic exploration problems often feature large, structured action sets and highly correlated beliefs. We focus on a paradigmatic exploration problem with structure: combinatorial semi-bandits. We prove that Thompson Sampling, when applied to combinatorial semi-bandits, is incentive-compatible when initialized with a sufficient number of samples of each arm (where this number is determined in advance by the Bayesian prior). Moreover, we design incentive-compatible algorithms for collecting the initial samples.
翻译:在推荐系统中, 用户可以自由选择其他动作, 并需要激励他们遵循算法的建议。 虽然用户更愿意开发, 算法可以激励他们通过利用从先前用户收集的信息进行探索。 所有关于该问题的公开著作, 被称为激励性探索, 都侧重于小型的、 无结构的行动组, 并且主要针对用户在行动上独立时的信仰。 但是, 现实的勘探问题通常以大型、 结构化的动作组和高度关联的信念为特征。 我们关注结构上的典型探索问题: 组合式半带。 我们证明, Thompson Sampling在应用到组合式半带时, 与每个手臂足够数量的样本( 由先前的巴伊西亚人事先确定) 初始样本初始时, 具有激励性兼容性。 此外, 我们设计了用于采集初始样本的激励式算法 。