限制的在线物流路线分配的深入强化学习方法 (A Deep Reinforcement Learning Approach for Constrained Online Logistics Route Assignment)

As online shopping prevails and e-commerce platforms emerge, there is a tremendous number of parcels being transported every day. Thus, it is crucial for the logistics industry on how to assign a candidate logistics route for each shipping parcel properly as it leaves a significant impact on the total logistics cost optimization and business constraints satisfaction such as transit hub capacity and delivery proportion of delivery providers. This online route-assignment problem can be viewed as a constrained online decision-making problem. Notably, the large amount (beyond ${10^5}$) of daily parcels, the variability and non-Markovian characteristics of parcel information impose difficulties on attaining (near-) optimal solution without violating constraints excessively. In this paper, we develop a model-free DRL approach named PPO-RA, in which Proximal Policy Optimization (PPO) is improved with dedicated techniques to address the challenges for route assignment (RA). The actor and critic networks use attention mechanism and parameter sharing to accommodate each incoming parcel with varying numbers and identities of candidate routes, without modeling non-Markovian parcel arriving dynamics since we make assumption of i.i.d. parcel arrival. We use recorded delivery parcel data to evaluate the performance of PPO-RA by comparing it with widely-used baselines via simulation. The results show the capability of the proposed approach to achieve considerable cost savings while satisfying most constraints.

翻译：随着网上购物和电子商务平台的出现,每天运输的包裹数量巨大,因此,对于物流行业而言,如何适当分配每个包裹的候选后勤路线至关重要,因为这会给物流总物流成本优化和商业限制满意度带来重大影响,如过境枢纽能力和交货供应商交付比例等。这种在线路线分配问题可被视为一个受限制的在线决策问题。值得注意的是,大量的每日包裹(超过${10 5美元),包裹的可变性和非马尔科维安特点,给在不过度违反限制的情况下实现(近于)最佳解决办法带来了困难。在本文件中,我们开发了名为PPO-RA的无示范DRL办法,其中Proximal Political Political(PPO)用专门技术改进了应对路线分配挑战的优化(RA) 。行为者和批评者网络使用关注机制和参数共享来适应每个到达的包裹,其数量和身份各不相同的候选路线,而不必建模非马尔科维安包裹的动态,因为我们假设i.d. 包裹抵达后,我们用最先进的交付能力来比较已记录下来的交付能力。我们使用经过最精确的交付能力,用模拟的基线评估业绩数据,以显示。