多河流订单调度的深层价值网络方法 (A Deep Value-network Based Approach for Multi-Driver Order Dispatching)

Recent works on ride-sharing order dispatching have highlighted the importance of taking into account both the spatial and temporal dynamics in the dispatching process for improving the transportation system efficiency. At the same time, deep reinforcement learning has advanced to the point where it achieves superhuman performance in a number of fields. In this work, we propose a deep reinforcement learning based solution for order dispatching and we conduct large scale online A/B tests on DiDi's ride-dispatching platform to show that the proposed method achieves significant improvement on both total driver income and user experience related metrics. In particular, we model the ride dispatching problem as a Semi Markov Decision Process to account for the temporal aspect of the dispatching actions. To improve the stability of the value iteration with nonlinear function approximators like neural networks, we propose Cerebellar Value Networks (CVNet) with a novel distributed state representation layer. We further derive a regularized policy evaluation scheme for CVNet that penalizes large Lipschitz constant of the value network for additional robustness against adversarial perturbation and noises. Finally, we adapt various transfer learning methods to CVNet for increased learning adaptability and efficiency across multiple cities. We conduct extensive offline simulations based on real dispatching data as well as online AB tests through the DiDi's platform. Results show that CVNet consistently outperforms other recently proposed dispatching methods. We finally show that the performance can be further improved through the efficient use of transfer learning.

翻译：最近关于搭便车分配订单的工作突出了在提高运输系统效率的调度过程中既考虑到空间和时间动态的重要性。与此同时,深层强化学习已发展到在某些领域达到超人性表现的程度。在这项工作中,我们提议了一个基于深度强化学习的发送订单的解决方案,我们在Didi的搭便车分配平台上进行大规模在线A/B测试,以显示拟议方法在驱动器总收入和用户经验相关度量上都取得了显著的改善。特别是,我们把搭便问题模拟成一个半马尔科夫决策程序,以考虑到调度行动的时间方面。为了提高与神经网络等非线性功能辅助者的超人性表现的稳定性,我们提议了Ceellellar价值网络(CVNet)的大规模在线测试,在新式分布的州代表层上进行大规模测试。我们进一步推出一个常规化的CVNet政策评价计划,该评估计划将大型Lipschitz网络的常态性能用来进一步抵御对抗性波动和噪音。最后,我们调整了各种在线性能转换方法,通过Simalimal Tal Testal 测试,以展示了C-Lialalalalalal ex train ex transportal ex astrational betrading acustrations