We study how a principal can efficiently and effectively intervene on the rewards of a previously unseen learning agent in order to induce desirable outcomes. This is relevant to many real-world settings like auctions or taxation, where the principal may not know the learning behavior nor the rewards of real people. Moreover, the principal should be few-shot adaptable and minimize the number of interventions, because interventions are often costly. We introduce MERMAIDE, a model-based meta-learning framework to train a principal that can quickly adapt to out-of-distribution agents with different learning strategies and reward functions. We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both $0$-shot and $K=1$-shot settings with partial agent information.
翻译:我们研究主体如何有效地干预先前未见过的学习器的奖励,以实现期望的结果,在实际场景中这是很关键的,比如拍卖或者税收中,主体可能不知道真实人的学习行为或奖励。此外,主体应该具备少样本适应性,尽量减少干预次数,因为干预往往代价高昂。我们引入了MERMAIDE,一种基于模型的元学习框架,用于训练主体,使其能够迅速适应具有不同学习策略和奖励函数的分布外代理。我们逐步验证了这种方法。首先,在最佳响应代理的Stackelberg博弈中,我们展示了元学习能够使得测试时快速收敛到理论已知的Stackelberg均衡,尽管嘈杂的观测严重增加了样本复杂度。然后,我们展示了我们的基于模型的元学习方法在具有其他探索-利用策略的赌博机代理的干预中是划算的。最后,我们在使用部分代理信息的0-shot和K = 1-shot训练中,击败了使用元学习或代理行为建模的基线。