We consider the problem of controlling an unknown stochastic linear system with quadratic costs - called the adaptive LQ control problem. We re-examine an approach called ''Reward Biased Maximum Likelihood Estimate'' (RBMLE) that was proposed more than forty years ago, and which predates the ''Upper Confidence Bound'' (UCB) method as well as the definition of ''regret'' for bandit problems. It simply added a term favoring parameters with larger rewards to the criterion for parameter estimation. We show how the RBMLE and UCB methods can be reconciled, and thereby propose an Augmented RBMLE-UCB algorithm that combines the penalty of the RBMLE method with the constraints of the UCB method, uniting the two approaches to optimism in the face of uncertainty. We establish that theoretically, this method retains $\Tilde{\mathcal{O}}(\sqrt{T})$ regret, the best-known so far. We further compare the empirical performance of the proposed Augmented RBMLE-UCB and the standard RBMLE (without the augmentation) with UCB, Thompson Sampling, Input Perturbation, Randomized Certainty Equivalence and StabL on many real-world examples including flight control of Boeing 747 and Unmanned Aerial Vehicle. We perform extensive simulation studies showing that the Augmented RBMLE consistently outperforms UCB, Thompson Sampling and StabL by a huge margin, while it is marginally better than Input Perturbation and moderately better than Randomized Certainty Equivalence.
翻译:本文考虑未知随机线性二次系统的控制问题,称为自适应LQ控制问题。我们重新审视了一种称为“奖励偏置最大似然估计”(RBMLE)的方法,该方法提出于四十多年前,早于“上置信边界”(UCB)方法和“遗憾”(regret)模型在赌博问题中的定义。它简单地在参数估计的目标函数中增加了一个偏向于更高奖励参数的项。本文介绍了如何将RBMLE和UCB方法相结合,提出了一种增强的RBMLE-UCB算法,该算法将RBMLE方法的惩罚与UCB方法的约束相结合,统一了两种面对不确定性的乐观方法。本文理论证明了,该方法保持了迄今为止已知的最佳$\Tilde{\mathcal{O}}(\sqrt{T})$的遗憾(regret)。本文在查尔斯斯塔克电视熊(Charles Stark Draper Laboratory)的波音747和无人机等许多真实世界的案例中,将提出的增强的RBMLE-UCB方法和标准的RBMLE(没有增强)与UCB、Thompson采样、输入扰动、随机化置信等效和稳定性进行了比较。本文进行了大量的模拟研究,结果表明增强的RBMLE方法在多个案例中总是优于UCB、Thompson采样和稳定性,而在输入扰动方面略微优于增强的RBMLE方法,在随机化置信等效方面中等优于增强的RBMLE方法。