This paper addresses distributional offline continuous-time reinforcement learning (DOCTR-L) with stochastic policies for high-dimensional optimal control. A soft distributional version of the classical Hamilton-Jacobi-Bellman (HJB) equation is given by a semilinear partial differential equation (PDE). This `soft HJB equation' can be learned from offline data without assuming that the latter correspond to a previous optimal or near-optimal policy. A data-driven solution of the soft HJB equation uses methods of Neural PDEs and Physics-Informed Neural Networks developed in the field of Scientific Machine Learning (SciML). The suggested approach, dubbed `SciPhy RL', thus reduces DOCTR-L to solving neural PDEs from data. Our algorithm called Deep DOCTR-L converts offline high-dimensional data into an optimal policy in one step by reducing it to supervised learning, instead of relying on value iteration or policy iteration methods. The method enables a computable approach to the quality control of obtained policies in terms of both their expected returns and uncertainties about their values.
翻译:本文用高维最佳控制(SciML)领域开发的软分布版经典汉密尔顿-雅科比-贝尔曼(HJB)等式由半线性部分方程式(PDE)提供。这个“软HJB等式”可以从离线数据中学习,而不必假设后者与先前的最佳或接近最佳的政策相对应。软HJB方程式的数据驱动解决方案使用科学机器学习(SciML)领域开发的神经多功能和物理内建神经网络的方法。所建议的方法,称为“SciPhy RL ”,因此减少了DOCTR-L 从数据中解析神经PDE。我们的算法叫做Eep DOCTR-L 将离线高维数据转换为最佳政策,将它降低到监督性学习,而不是依赖神经多功能或政策迭代法。该方法使得在预期的回报和不确定性方面对所获得的政策的质量控制方法进行了可比较。