We consider the task of learning to control a linear dynamical system under fixed quadratic costs, known as the Linear Quadratic Regulator (LQR) problem. While model-free approaches are often favorable in practice, thus far only model-based methods, which rely on costly system identification, have been shown to achieve regret that scales with the optimal dependence on the time horizon T. We present the first model-free algorithm that achieves similar regret guarantees. Our method relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space in this setting.
翻译:我们认为学习在固定二次成本下控制线性动态系统的任务,即所谓的线性二次调节(LQR)问题。 虽然无模型方法在实践中往往比较有利,但迄今为止,只有依赖昂贵系统识别的基于模型的方法,证明只有依赖昂贵系统识别的模型方法,才对最充分地依赖时间跨度的尺度感到遗憾。 我们提出了第一个实现类似遗憾保证的无模型算法。 我们的方法依赖于有效的政策梯度计划,以及对在这一背景下探索政策空间的成本进行新颖和更严格的分析。