Many real-world offline reinforcement learning (RL) problems involve continuous-time environments with delays. Such environments are characterized by two distinctive features: firstly, the state x(t) is observed at irregular time intervals, and secondly, the current action a(t) only affects the future state x(t + g) with an unknown delay g > 0. A prime example of such an environment is satellite control where the communication link between earth and a satellite causes irregular observations and delays. Existing offline RL algorithms have achieved success in environments with irregularly observed states in time or known delays. However, environments involving both irregular observations in time and unknown delays remains an open and challenging problem. To this end, we propose Neural Laplace Control, a continuous-time model-based offline RL method that combines a Neural Laplace dynamics model with a model predictive control (MPC) planner--and is able to learn from an offline dataset sampled with irregular time intervals from an environment that has a inherent unknown constant delay. We show experimentally on continuous-time delayed environments it is able to achieve near expert policy performance.
翻译:许多现实世界的脱机强化学习(RL)问题涉及具有延迟的连续时间环境。这种环境具有两个独特的特点:首先,状态x(t)在不规则的时间间隔下观察到;其次,当前行动a(t)仅影响未来状态x(t + g),其中延迟g > 0是未知的。这种环境的主要示例是卫星控制,其中地球和卫星之间的通信链接导致不规则的观测和延迟。现有的离线RL算法已经在时间或已知延迟的不规则观察状态环境中取得了成功。然而,涉及时间不规则观测和未知延迟的环境仍然是一个开放而具有挑战性的问题。为此,我们提出了神经拉普拉斯控制,这是一种连续时间基于模型的离线RL方法,将神经拉普拉斯动力学模型与模型预测控制(MPC)规划器相结合,并能够从具有固有未知常数延迟的环境中采样不规则时间间隔的离线数据集学习。我们在连续时间延迟环境上进行实验,结果显示该方法能够达到接近专家策略的性能。