Diffusion processes that evolve according to linear stochastic differential equations are an important family of continuous-time dynamic decision-making models. Optimal policies are well-studied for them, under full certainty about the drift matrices. However, little is known about data-driven control of diffusion processes with uncertain drift matrices as conventional discrete-time analysis techniques are not applicable. In addition, while the task can be viewed as a reinforcement learning problem involving exploration and exploitation trade-off, ensuring system stability is a fundamental component of designing optimal policies. We establish that the popular Thompson sampling algorithm learns optimal actions fast, incurring only a square-root of time regret, and also stabilizes the system in a short time period. To the best of our knowledge, this is the first such result for Thompson sampling in a diffusion process control problem. We validate our theoretical results through empirical simulations with real parameter matrices from two settings of airplane and blood glucose control. Moreover, we observe that Thompson sampling significantly improves (worst-case) regret, compared to the state-of-the-art algorithms, suggesting Thompson sampling explores in a more guarded fashion. Our theoretical analysis involves characterization of a certain optimality manifold that ties the local geometry of the drift parameters to the optimal control of the diffusion process. We expect this technique to be of broader interest.
翻译:根据线性随机差异方程式演化的传播过程是连续时间动态决策模型的重要组合。最佳政策对它们进行了很好的研究,对漂移矩阵有充分的把握。然而,由于传统的离散时间分析技术不适用,对数据驱动的流动矩阵扩散过程控制不甚清楚,因为传统的离散时间分析技术不适用。此外,尽管这项任务可被视为一个强化学习问题,涉及勘探和开发交易,但确保系统稳定性是设计最佳政策的一个基本组成部分。我们确定受欢迎的Thompson抽样算法能够快速地学习最佳行动,只产生平方根的时间遗憾,并在很短的时间内稳定系统。据我们所知,这是在扩散过程控制问题中首次以数据驱动的方式控制流传过程。我们通过两个飞机和血压控制环境的实际参数模拟来验证我们的理论结果。此外,我们观察到,与最新算法相比,汤普森抽样可大大改善(最坏的)遗憾,建议以更保守的方式探索更广义的地理扩散过程。我们理论分析需要将这种最保守的地理流动方法定性。