用于连续时间线性水下强化强化学习的英制正化器的最佳优化时间安排安排 (Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning)

This work uses the entropy-regularised relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning but, on the other hand, introduce bias by assigning a positive probability to non-optimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularisation. We study algorithms resulting from two entropy regularisation formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalises the divergence of policies between two consecutive episodes. We analyse the finite horizon continuous-time linear-quadratic (LQ) RL problem for which both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularisation, we prove that the regret, for both learning algorithms, is of the order $\mathcal{O}(\sqrt{N}) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.

翻译：这项工作使用进精常规化的放松的随机控制观点作为设计强化学习(RL)算法的原则框架。内地代理与环境互动, 产生根据最佳放松政策分配的噪音控制, 产生噪音控制, 从而产生根据最佳放松政策分布的噪音控制。噪音政策一方面探索空间, 从而便利学习, 但另一方面, 将积极概率分配给非最佳行动, 从而引入偏偏偏, 给非最佳行动带来积极的概率。这个探索- 开发交易权的权衡量由进精常规化的强度决定。我们研究由两个变精常规化的配方( ) 产生的算法: 探索性控制法, 向成本目标添加了酶, 与环境互动, 产生噪音政策根据最优控制, 快速政策调整取样频率, 从放松政策调整美元的频率, 定期的DNA, 从放松政策、不断持续时间- 线- 线- 方(LQ) 问题, 两种算法都会产生高的放松政策。我们量化高的文献政策及其调调度评估之间的准确差值函数函数函数函数值功能, 显示执行噪音必须跨越时间独立。。通过调调取取取的取样的频率,,, 从放松的频率的频率的频率的频率, 从放松的频率的频率的频率的频率的频率的频率是,,, 校的频率的频率的频率的频率的频率,,,,, 校的校的校的校的校的校的校的校的频率是校的校的校的校的校的校的频率是,,,,,, 校的校的校的校的的校的校的的的的的的的的的的的的的的的的的的的的的的的校的校的的的校的校的校的的的的的的校的的的的的的的的的的的的的的的的校的的的的校的的