We consider the problem of controlling a stochastic linear system with quadratic costs, when its system parameters are not known to the agent -- called the adaptive LQG control problem. We re-examine an approach called "Reward-Biased Maximum Likelihood Estimate" (RBMLE) that was proposed more than forty years ago, and which predates the "Upper Confidence Bound" (UCB) method as well as the definition of "regret". It simply added a term favoring parameters with larger rewards to the estimation criterion. We propose an augmented approach that combines the penalty of the RBMLE method with the constraint of the UCB method, uniting the two approaches to optimization in the face of uncertainty. We first establish that theoretically this method retains $\mathcal{O}(\sqrt{T})$ regret, the best known so far. We show through a comprehensive simulation study that this augmented RBMLE method considerably outperforms the UCB and Thompson sampling approaches, with a regret that is typically less than 50\% of the better of their regrets. The simulation study includes all examples from earlier papers as well as a large collection of randomly generated systems.
翻译:我们考虑的是控制具有二次成本的随机线性系统的问题,当代理人不知道其系统参数时,即所谓的适应性LQG控制问题。我们重新审查了40多年前提出的称为“奖励-比额最大可能性估计”(RBMLE)的方法(RBMLE),该方法在“最大信任”方法(UCB)和“REgret”定义之前就已经存在。它只是为估算标准添加了一个有利于参数和更大奖赏的术语。我们建议了一种扩大的方法,将MBLLE方法的处罚与UCB方法的制约结合起来,在不确定性面前将两种优化方法合并在一起。我们首先确定,理论上,这种方法保留了迄今为止最著名的$\mathcal{O}(sqrt{T})$。我们通过全面模拟研究表明,这种增强的MBBLEE方法大大超越了UC和Thompson抽样方法,但遗憾通常少于50 ⁇ 。模拟研究包括早期文件的所有例子,作为大量随机收集系统。