Dynamic mechanism design has garnered significant attention from both computer scientists and economists in recent years. By allowing agents to interact with the seller over multiple rounds, where agents' reward functions may change with time and are state dependent, the framework is able to model a rich class of real world problems. In these works, the interaction between agents and sellers are often assumed to follow a Markov Decision Process (MDP). We focus on the setting where the reward and transition functions of such an MDP are not known a priori, and we are attempting to recover the optimal mechanism using an a priori collected data set. In the setting where the function approximation is employed to handle large state spaces, with only mild assumptions on the expressiveness of the function class, we are able to design a dynamic mechanism using offline reinforcement learning algorithms. Moreover, learned mechanisms approximately have three key desiderata: efficiency, individual rationality, and truthfulness. Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set. To the best of our knowledge, our work provides the first offline RL algorithm for dynamic mechanism design without assuming uniform coverage.
翻译:近年来,计算机科学家和经济学家都非常关注动态机制设计。通过允许代理商与卖方进行多轮互动,使代理商的奖励功能随时间变化而变化,并取决于国家,该框架能够模拟一大批真实世界问题。在这些工程中,代理商和卖方之间的互动通常假定遵循Markov决策程序(MDP 程序 ) 。我们侧重于这样一个MDP的奖赏和过渡功能不先验的设定,我们正试图利用先验收集的数据集恢复最佳机制。在使用功能近似处理大州空间的设置中,我们只能对功能类的清晰度进行轻微的假设,我们能够设计出一个动态机制,使用离线强化学习算法。此外,学习机制大概有三个关键偏差:效率、个人理性和真实性。我们的算法基于悲观原则,只需要对离线数据集的覆盖进行温和的假设。为了我们的最佳知识,我们的工作为动态机制的设计提供了第一个离线 RL 算法,而没有假设统一的覆盖。