We revisit the problem of controlling linear systems with quadratic cost under unknown dynamics with model-based reinforcement learning. Traditional methods like Optimism in the Face of Uncertainty and Thompson Sampling, rooted in multi-armed bandits (MABs), face practical limitations. In contrast, we propose an alternative based on the Confusing Instance (CI) principle, which underpins regret lower bounds in MABs and discrete Markov Decision Processes (MDPs) and is central to the Minimum Empirical Divergence (MED) family of algorithms, known for their asymptotic optimality in various settings. By leveraging the structure of LQR policies along with sensitivity and stability analysis, we develop MED-LQ. This novel control strategy extends the principles of CI and MED beyond small-scale settings. Our benchmarks on a comprehensive control suite demonstrate that MED-LQ achieves competitive performance in various scenarios while highlighting its potential for broader applications in large-scale MDPs.
翻译:本文重新审视了在未知动力学下通过基于模型的强化学习控制线性系统并最小化二次成本的问题。传统方法如面对不确定性的乐观策略和汤普森采样,其根源在于多臂老虎机问题,存在实际局限性。相比之下,我们提出了一种基于混淆实例原理的替代方案。该原理是多臂老虎机和离散马尔可夫决策过程中遗憾下界的基础,也是最小经验散度算法家族的核心,该家族算法在多种设定下以渐近最优性著称。通过结合线性二次调节器策略的结构以及灵敏度与稳定性分析,我们开发了MED-LQ。这一新颖的控制策略将混淆实例和最小经验散度的原理扩展到了小规模设定之外。我们在一个综合控制测试集上的基准实验表明,MED-LQ在各种场景下均取得了具有竞争力的性能,同时凸显了其在大规模马尔可夫决策过程中更广泛应用的潜力。