We consider a sequential decision making problem where the agent faces the environment characterized by the stochastic discrete events and seeks an optimal intervention policy such that its long-term reward is maximized. This problem exists ubiquitously in social media, finance and health informatics but is rarely investigated by the conventional research in reinforcement learning. To this end, we present a novel framework of the model-based reinforcement learning where the agent's actions and observations are asynchronous stochastic discrete events occurring in continuous-time. We model the dynamics of the environment by Hawkes process with external intervention control term and develop an algorithm to embed such process in the Bellman equation which guides the direction of the value gradient. We demonstrate the superiority of our method in both synthetic simulator and real-world problem.
翻译:我们认为,在代理人面临以零散的离散事件为特点的环境时,这是一个顺序决策问题,我们寻求一种最佳干预政策,使其长期的回报最大化。这个问题在社交媒体、金融和健康信息学中普遍存在,但很少受到强化学习常规研究的调查。为此,我们提出了一个基于模型的强化学习新框架,该代理人的行动和观察是连续发生的不同步的分离事件。我们用外部干预控制术语来模拟霍克斯进程的环境动态,并开发一种算法,将这种进程嵌入指导价值梯度方向的贝尔曼方程式中。我们展示了我们的方法在合成模拟器和实际世界问题上的优势。