In this paper, we study a sequential decision making problem faced by e-commerce carriers related to when to send out a vehicle from the central depot to serve customer requests, and in which order to provide the service, under the assumption that the time at which parcels arrive at the depot is stochastic and dynamic. The objective is to maximize the number of parcels that can be delivered during the service hours. We propose two reinforcement learning approaches for solving this problem, one based on a policy function approximation (PFA) and the second on a value function approximation (VFA). Both methods are combined with a look-ahead strategy, in which future release dates are sampled in a Monte-Carlo fashion and a tailored batch approach is used to approximate the value of future states. Our PFA and VFA make a good use of branch-and-cut-based exact methods to improve the quality of decisions. We also establish sufficient conditions for partial characterization of optimal policy and integrate them into PFA/VFA. In an empirical study based on 720 benchmark instances, we conduct a competitive analysis using upper bounds with perfect information and we show that PFA and VFA greatly outperform two alternative myopic approaches. Overall, PFA provides best solutions, while VFA (which benefits from a two-stage stochastic optimization model) achieves a better tradeoff between solution quality and computing time.
翻译:在本文件中,我们研究了电子商务承运人所面临的一个顺序决策问题,即何时从中央仓库派出车辆以满足客户的要求,以及何时提供服务,在假定包裹到达仓库的时间是随机的和动态的前提下,电子商务承运人面临一个顺序决策问题;目标是尽量增加在服务时间内可以交付的包裹数量;我们提出了解决这一问题的两种强化学习方法,一种基于政策功能近似(PFA),另一种基于价值函数近似(VFA)。两种方法都与外观战略相结合,在该战略中,未来发行日期以蒙特卡罗方式取样,并采用量身定制的批量方法来估计未来国家的价值;我们的PFA和VFA都很好地利用基于分支和切割的精确方法来提高决策质量;我们还为部分确定最佳政策特征并将它们纳入PFA/VFA。在一项基于720基准实例的经验性研究中,我们利用完美信息进行竞争性分析,并显示PFA和VFA大大超越了未来国家的价值;同时,我们从PFA和VFA提供了两种最佳的贸易效益,而从PA级至FA最佳的两种办法则提供了最佳选择。