We derive a learning framework to generate routing/pickup policies for a fleet of vehicles tasked with servicing stochastically appearing requests on a city map. We focus on policies that 1) give rise to coordination amongst the vehicles, thereby reducing wait times for servicing requests, 2) are non-myopic, considering a-priori unknown potential future requests, and 3) can adapt to changes in the underlying demand distribution. Specifically, we are interested in adapting to fluctuations of actual demand conditions in urban environments, such as on-peak vs. off-peak hours. We achieve this through a combination of (i) online play, a lookahead optimization method that improves the performance of rollout methods via an approximate policy iteration step, and (ii) an offline approximation scheme that allows for adapting to changes in the underlying demand model. In particular, we achieve adaptivity of our learned policy to different demand distributions by quantifying a region of validity using the q-valid radius of a Wasserstein Ambiguity Set. We propose a mechanism for switching the originally trained offline approximation when the current demand is outside the original validity region. In this case, we propose to use an offline architecture, trained on a historical demand model that is closer to the current demand in terms of Wasserstein distance. We learn routing and pickup policies over real taxicab requests in downtown San Francisco with high variability between on-peak and off-peak hours, demonstrating the ability of our method to adapt to real fluctuation in demand distributions. Our numerical results demonstrate that our method outperforms rollout-based reinforcement learning, as well as several benchmarks based on classical methods from the field of operations research.
翻译:我们制定学习框架,为负责在城市地图上满足令人发指的要求的车队制定路由/采购政策;我们注重以下政策:(1) 促成车辆之间的协调,从而缩短服务请求的等待时间;(2) 具有非中观性,考虑到一个未知的潜在潜在未来需求,(3) 可以适应基本需求分配的变化; 具体地说,我们有兴趣适应城市环境中实际需求条件的波动,例如,在高点对低点对低点时段的调整;我们通过以下组合实现这一点:(一) 在线游戏,一种外观优化方法,通过近似政策迭代步骤,改善车辆之间的协调,从而减少车辆服务请求的等待时间;(二) 离线近点计划,以便适应基本需求模式的变化;特别是,我们通过使用基于正值的瓦塞斯坦安比基时段半径量化一个有效的区域,我们建议一种机制,当当前需求超出实际需求时,将最初经过培训的离线优化的离线优化方法,通过近点调整方法,通过大致政策向更近点显示当前需求,我们提议在高点的汇率中,我们采用一个学习的跨点研究方法。