Motivated by a number of real-world applications from domains like healthcare and sustainable transportation, in this paper we study a scenario of repeated principal-agent games within a multi-armed bandit (MAB) framework, where: the principal gives a different incentive for each bandit arm, the agent picks a bandit arm to maximize its own expected reward plus incentive, and the principal observes which arm is chosen and receives a reward (different than that of the agent) for the chosen arm. Designing policies for the principal is challenging because the principal cannot directly observe the reward that the agent receives for their chosen actions, and so the principal cannot directly learn the expected reward using existing estimation techniques. As a result, the problem of designing policies for this scenario, as well as similar ones, remains mostly unexplored. In this paper, we construct a policy that achieves a low regret (i.e., square-root regret up to a log factor) in this scenario for the case where the agent has perfect-knowledge about its own expected rewards for each bandit arm. We design our policy by first constructing an estimator for the agent's expected reward for each bandit arm. Since our estimator uses as data the sequence of incentives offered and subsequently chosen arms, the principal's estimation can be regarded as an analogy of online inverse optimization in MAB's. Next we construct a policy that we prove achieves a low regret by deriving finite-sample concentration bounds for our estimator. We conclude with numerical simulations demonstrating the applicability of our policy to real-life setting from collaborative transportation planning.
翻译:在这篇论文中,我们基于来自医疗保健和可持续交通等领域的一系列现实世界中的应用,研究了MAB框架下的重复的委托代理博弈场景,其中:委托人对每个老虎机臂提供不同的激励,代理选择老虎机臂以最大化其期望奖励和激励,委托人观察到选择的臂并为所选臂提供一项奖励(不同于代理),为了这个场景,以及类似的场景,为委托者设计策略是具有挑战性的,因为委托者无法直接观察代理人为其选择的动作获得的奖励,因此,委托者不能直接使用现有的估计技术来学习期望奖励。因此,这个问题的设计政策,以及类似的问题,仍然是未经探索的。在这篇论文中,我们构建了一个策略,在代理人对每个老虎机臂的期望奖励具有完全知识的情况下,实现了低遗憾(即,平方根遗憾,最多包含一个对数因子)。我们通过首先构建代理人每个老虎机臂期望奖励的估计器来设计我们的策略。由于我们的估计器使用的是提供的激励序列和随后选择的臂,因此,委托人的估计可以被看作是MAB中在线反向优化的类比。接下来,我们构建了一个策略,证明了我们的估计器达到了低遗憾率,通过推导我们的估计器的有限样本集中界限。最后,我们通过数值模拟证明了我们的策略适用于来自协同交通规划等实际情况。