Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are often poorly understood because of the nonconvex nature of the underlying optimization problems and the lack of exact gradient computation. In this paper, we take a step towards demystifying the performance and efficiency of such methods by focusing on the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and show that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE. We also provide theoretical bounds on the convergence rate and sample complexity of the random search method with two-point gradient estimates. We prove that the required simulation time for achieving $\epsilon$-accuracy in the model-free setup and the total number of function evaluations both scale as $\log \, (1/\epsilon)$.
翻译:无模型强化学习尝试通过直接搜索控制器的参数空间,为未知动态系统找到最佳控制行动。这些方法的趋同行为和统计特性往往不甚为人理解,因为潜在的优化问题没有隐蔽性质,而且没有精确的梯度计算。在本文中,我们采取了一种步骤,通过侧重于标准无限偏松线性线性二次调节器问题,来消除这些方法的性能和效率,这种系统具有未知的状态空间参数。我们为规范稳定反馈收益集的梯度-流动动态的普通差异方程(ODE)建立了指数稳定性,并表明由于相应的ODE的远向分解而产生的梯度下降法也存在类似结果。我们还提供了理论界限,以两点梯度估计值来说明随机搜索方法的趋同率和抽样复杂性。我们证明,在无模式设置中实现美元/欧元的准确度所需的模拟时间,以及以美元/log/(1/\epsilon)计的功能评价总数。