野生强化学习:在赛道市场部署的可缩放RL调度算法 (Reinforcement Learning in the Wild: Scalable RL Dispatching Algorithm Deployed in Ridehailing Marketplace)

In this study, a real-time dispatching algorithm based on reinforcement learning is proposed and for the first time, is deployed in large scale. Current dispatching methods in ridehailing platforms are dominantly based on myopic or rule-based non-myopic approaches. Reinforcement learning enables dispatching policies that are informed of historical data and able to employ the learned information to optimize returns of expected future trajectories. Previous studies in this field yielded promising results, yet have left room for further improvements in terms of performance gain, self-dependency, transferability, and scalable deployment mechanisms. The present study proposes a standalone RL-based dispatching solution that is equipped with multiple mechanisms to ensure robust and efficient on-policy learning and inference while being adaptable for full-scale deployment. A new form of value updating based on temporal difference is proposed that is more adapted to the inherent uncertainty of the problem. For the driver-order assignment, a customized utility function is proposed that when tuned based on the statistics of the market, results in remarkable performance improvement and interpretability. In addition, for reducing the risk of cancellation after drivers' assignment, an adaptive graph pruning strategy based on the multi-arm bandit problem is introduced. The method is evaluated using offline simulation with real data and yields notable performance improvement. In addition, the algorithm is deployed online in multiple cities under DiDi's operation for A/B testing and is launched in one of the major international markets as the primary mode of dispatch. The deployed algorithm shows over 1.3% improvement in total driver income from A/B testing. In addition, by causal inference analysis, as much as 5.3% improvement in major performance metrics is detected after full-scale deployment.

翻译：在本研究中,提出了基于强化学习的实时发送算法,这是首次大规模部署。目前,骑车平台的发送方法主要基于短视或基于规则的非中观方法。强化学习有助于发布了解历史数据的政策,并能够利用所学信息优化未来预期轨迹的回报。过去在这一领域的研究产生了令人乐观的结果,但尚留有进一步改进的空间,在绩效增益、自依赖性、可转移性和可缩放部署机制方面可以实现进一步改进。本研究提出了基于骑车平台的单独 RL 发送方法,该方法主要基于短视或基于规则的非中观方法。强化学习有助于发布了解历史数据的政策,并能够利用所学到的信息信息优化未来轨道的回报。对于司机的调度任务,在根据市场统计数据调整后,业绩的显著改进和可理解性提升。此外,在司机的在线部署后取消风险时,正在采用一个具有适应性弹性的量化的量化的量化的流程,即通过一个升级的系统测试,通过一个升级系统,通过一个升级的系统测试,通过一个升级系统,通过一个升级系统,通过一个系统,通过升级,通过一个自动测试,通过一个自动测试,对驱动数据测试,对驱动数据进行升级,对结果进行升级,对驱动结果进行升级。