We study a stochastic variant of the vehicle routing problem arising in the context of domestic donor collection services. The problem we consider combines the following attributes. Customers requesting services are variable, in the sense that the customers are stochastic but are not restricted to a predefined set, as they may appear anywhere in a given service area. Furthermore, demand volumes are stochastic and observed upon visiting the customer. The objective is to maximize the expected served demands while meeting vehicle capacity and time restrictions. We call this problem the VRP with a highly Variable Customer basis and Stochastic Demands (VRP-VCSD). For this problem, we first propose a Markov Decision Process (MDP) formulation representing the classical centralized decision-making perspective where one decision-maker establishes the routes of all vehicles. While the resulting formulation turns out to be intractable, it provides us with the ground to develop a new MDP formulation, which we call partially decentralized. In this formulation, the action-space is decomposed by vehicle. However, the decentralization is incomplete as we enforce identical vehicle-specific policies while optimizing the collective reward. We propose several strategies to reduce the dimension of the state and action spaces associated with the partially decentralized formulation. These yield a considerably more tractable problem, which we solve via Reinforcement Learning. In particular, we develop a Q-learning algorithm called DecQN, featuring state-of-the-art acceleration techniques. We conduct a thorough computational analysis. Results show that DecQN considerably outperforms three benchmark policies. Moreover, we show that our approach can compete with specialized methods developed for the particular case of the VRP-VCSD, where customer locations and expected demands are known in advance.
翻译:我们研究的是在国内捐助方收集服务背景下产生的车辆路由问题的随机变体。 我们认为问题结合了以下属性。 要求服务的客户是多变的, 即客户是随机的, 但并不局限于预定的一套, 因为可能出现在特定服务领域任何地方。 此外, 需求量是随机的, 在访问客户时观察到的。 目标是在满足车辆容量和时间限制的同时最大限度地满足预期服务需求。 我们称这个问题为VRP, 客户基础和托盘需求(VRP-VSD) 。 对于这一问题,我们首先提出一个Markov 决策程序(MDP), 代表典型的集中决策观点, 由一位决策者来设定所有车辆的路线。 虽然由此产生的配方看起来很复杂, 为我们提供了制定新的MDP 配置的基础, 我们称之为部分分散。 在这种配方中, 行动空间被车辆分解。 然而, 分散化是不完全的, 我们执行相同的车辆特定政策, 优化集体奖赏。 我们提出一些战略, 代表典型的中央决策性决策, 确定所有车辆的路径, 我们大幅削减了我们逐渐升级的进度 。