We develop a probabilistic framework for analysing model-based reinforcement learning in the episodic setting. We then apply it to study finite-time horizon stochastic control problems with linear dynamics but unknown coefficients and convex, but possibly irregular, objective function. Using probabilistic representations, we study regularity of the associated cost functions and establish precise estimates for the performance gap between applying optimal feedback control derived from estimated and true model parameters. We identify conditions under which this performance gap is quadratic, improving the linear performance gap in recent work [X. Guo, A. Hu, and Y. Zhang, arXiv preprint, arXiv:2104.09311, (2021)], which matches the results obtained for stochastic linear-quadratic problems. Next, we propose a phase-based learning algorithm for which we show how to optimise exploration-exploitation trade-off and achieve sublinear regrets in high probability and expectation. When assumptions needed for the quadratic performance gap hold, the algorithm achieves an order $\mathcal{O}(\sqrt{N} \ln N)$ high probability regret, in the general case, and an order $\mathcal{O}((\ln N)^2)$ expected regret, in self-exploration case, over $N$ episodes, matching the best possible results from the literature. The analysis requires novel concentration inequalities for correlated continuous-time observations, which we derive.
翻译:我们开发了一个概率框架,用于分析在偶发环境下基于模型的强化学习。 然后,我们运用它来研究具有线性动态但未知系数和convex、但可能不规则的客观功能的有限时平地平线随机控制问题。 我们利用概率表,研究相关成本函数的规律性,并为应用根据估计和真实模型参数得出的最佳反馈控制之间的绩效差距确定精确的估算值。 我们确定这种绩效差距具有四分化性的条件,从而改善最近工作中的线性绩效差距[X. Guo, A. Hu和Y. Zhang, ArXiv预印, arXiv: 2104.0931, arXiv: 2104.09311, arX21], 与对线性直线性平方问题的结果相匹配。 下一步,我们提出基于阶段的学习算法,为此我们展示如何优化探索-开发交易,并实现高概率和预期的子线性遗憾。 当新品表现差距需要假设时, 方算得出 $nqral_ral crial case, rum crual crude a ex, proal roqrus roal case, proal ex, proqrus axxxxx $xxxxxxxxxxxxxxxxxxxxx。