This paper introduces an adaptive model-free deep reinforcement approach that can recognize and adapt to the diurnal patterns in the ride-sharing environment with car-pooling. Deep Reinforcement Learning (RL) suffers from catastrophic forgetting due to being agnostic to the timescale of changes in the distribution of experiences. Although RL algorithms are guaranteed to converge to optimal policies in Markov decision processes (MDPs), this only holds in the presence of static environments. However, this assumption is very restrictive. In many real-world problems like ride-sharing, traffic control, etc., we are dealing with highly dynamic environments, where RL methods yield only sub-optimal decisions. To mitigate this problem in highly dynamic environments, we (1) adopt an online Dirichlet change point detection (ODCP) algorithm to detect the changes in the distribution of experiences, (2) develop a Deep Q Network (DQN) agent that is capable of recognizing diurnal patterns and making informed dispatching decisions according to the changes in the underlying environment. Rather than fixing patterns by time of week, the proposed approach automatically detects that the MDP has changed, and uses the results of the new model. In addition to the adaptation logic in dispatching, this paper also proposes a dynamic, demand-aware vehicle-passenger matching and route planning framework that dynamically generates optimal routes for each vehicle based on online demand, vehicle capacities, and locations. Evaluation on New York City Taxi public dataset shows the effectiveness of our approach in improving the fleet utilization, where less than 50% of the fleet are utilized to serve the demand of up to 90% of the requests, while maximizing profits and minimizing idle times.
翻译:本文引入了适应性、无模型深度强化方法,可以识别和适应搭车共享环境中与汽车共享环境中的双向模式。深强化学习(RL)由于对经验分布变化的时间尺度具有不可知性,因此被灾难性地遗忘了。虽然RL算法保证与Markov决策流程(MDPs)的最佳政策趋同,但这只存在于静态环境中。然而,这一假设是限制性很强的。在许多现实世界中,如搭车共享、交通控制等,我们所处理的是高度动态的环境,在那里,RL方法只能产生亚优性决定。为了在高度动态环境中缓解这一问题,深强化学习(RL)会遭遇灾难性的忘记,因为对经验分布变化的时空不可知。虽然Rellichlet 更改点检测(ODCP)算法保证了在Markov决策流程(MDPs)中与最佳政策趋同,但只有在静态环境存在的情况下,这只能维持。在基本环境的变化中做出知情的发送决定。在周内,拟议的方法不是固定模式,而是自动检测MDP已经变化过次最佳的次最佳的利用次最佳的利用。同时,并且利用了新动力路路路路段对新车的需求。在新车需求进行最慢的频率的频率上,同时,在最慢的频率上显示。在新的路路路路路路路段中,在新路路段中也显示。