Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton-Jacobi-Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.
翻译:通常在强化学习(RL)中,用指数函数来模拟时间偏好,从而约束预期的长期奖励。相比之下,在经济学和心理学中,人们发现,人类往往采用双曲折扣办法,在假定特定任务终止时间分配时,这是最佳的。在这项工作中,我们提出了一个基于连续时间模型的强化学习理论,该理论普遍适用于任意折扣功能。这一提法涵盖了一个非穷效随机终止时间的情况。我们产生了一个汉密尔顿-贾科比-贝尔曼(HJB)等式,作为最佳政策的特征,并描述如何使用合用同一方法解决这个问题,该方法使用深度学习来接近功能。此外,我们展示了如何解决反转折价问题的方法,其中一个人试图恢复给决策数据的折扣功能的属性。我们验证了我们提出的方法在两个模拟问题上的适用性。我们的方法为在连续决策任务中分析人类折价开辟了途径。