This work uses the entropy-regularised relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies on the one hand, explore the space and hence facilitate learning but, on the other hand, introduce bias by assigning a positive probability to non-optimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularisation. We study algorithms resulting from two entropy regularisation formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalises the divergence of policies between two consecutive episodes. We analyse the finite horizon continuous-time linear-quadratic (LQ) RL problem for which both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularisation, we prove that the regret, for both learning algorithms, is of the order $\mathcal{O}(\sqrt{N}) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.
翻译:这项工作使用进化常规化的精度 放松鼻孔控制视角作为设计强化学习(RL)算法的原则框架。 内地代理器通过根据最佳宽松政策进行分散的噪音控制, 产生噪音控制, 从而与环境互动。 噪音政策一方面探索空间, 从而便利学习, 但另一方面, 将积极概率分配给非最佳行动, 从而引入偏差。 这个探索- 开发交易由进化常规化的强度决定 。 我们研究由两个进化常规化配方产生的算法: 探索控制方法, 将通气添加到成本目标中, 以及 准ximal政策更新方法, 从而产生对环境的反差。 我们分析有限地平线连续线- 线性赤道(LQ) RL 问题, 使非最优性的行动产生一个高度的放松政策。 我们量化了高利度政策的价值函数与高温评估之间的准确差异, 并显示执行噪音必须跨越时间独立。 通过调调采样的频率, 从放松的政策的频率到定期的DNA,, 校正 。