双面市场非政策评价多边机构强化学习框架 (A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets)

The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of our proposed method is available at https://github.com/RunzheStat/CausalMARL.

翻译：由于智能电话和互联网的迅速发展,它们大大改变了人类的交通状况。在本文件中,我们考虑对搭车公司进行大规模车队管理,这些公司涉及不同领域的多个单位,长期接受一系列产品(或治疗),在不同的领域接受一系列产品(或治疗),这些研究中出现重大技术挑战,例如政策评价,因为(一) 空间和时间的近似性引起不同地点和时间之间的干扰;和(二) 大量地点造成多元化的诅咒。为了同时应对这两个挑战,我们引入了一个多剂强化学习框架,以进行这些研究的政策评价。我们建议对不同产品下的各种产品采用新的平均结果进行新的估计,尽管国家行动空间具有高度的高度特征。拟议的估计在模拟实验中可发挥有利的作用。我们进一步说明我们使用从一个双面市场公司获得的真实数据集来评价不同补贴政策的效果的方法。我们的拟议方法可在http://Ruthalz/Rusgas.