We introduce the first direct policy search algorithm which provably converges to the globally optimal $\textit{dynamic}$ filter for the classical problem of predicting the outputs of a linear dynamical system, given noisy, partial observations. Despite the ubiquity of partial observability in practice, theoretical guarantees for direct policy search algorithms, one of the backbones of modern reinforcement learning, have proven difficult to achieve. This is primarily due to the degeneracies which arise when optimizing over filters that maintain internal state. In this paper, we provide a new perspective on this challenging problem based on the notion of $\textit{informativity}$, which intuitively requires that all components of a filter's internal state are representative of the true state of the underlying dynamical system. We show that informativity overcomes the aforementioned degeneracy. Specifically, we propose a $\textit{regularizer}$ which explicitly enforces informativity, and establish that gradient descent on this regularized objective - combined with a ``reconditioning step'' - converges to the globally optimal cost a $\mathcal{O}(1/T)$. Our analysis relies on several new results which may be of independent interest, including a new framework for analyzing non-convex gradient descent via convex reformulation, and novel bounds on the solution to linear Lyapunov equations in terms of (our quantitative measure of) informativity.
翻译:我们引入了第一个直接政策搜索算法, 与全球最优化的 $\ textit{ vidivil} 美元过滤器相匹配, 解决预测线性动态系统产出的典型问题, 并给出了考虑到噪音和局部观察的典型问题。 尽管在实践中,部分易懂, 直接政策搜索算法(现代强化学习的支柱之一)的理论保障证明难以实现。 这主要是由于在优化维持内部状态的过滤器时产生的变异现象。 在本文件中, 我们根据 $\ textit{ intformativity} 概念, 对这一具有挑战性的问题提供了一个新的视角。 这自然要求过滤器内部状态的所有组成部分都代表了基本动态系统的真实状态。 我们表明, 缺乏信息化克服了上述的变异性。 具体地说, 我们提议了美元( textitle) { reformal) $(textitive) 下降, 并确定了这个固定性目标上的梯度下降值- 结合一个定量步骤- 与全球最优化的成本( $\ mathcalalalalal) {O) 新的递性框架的分析结果, 可能依靠新的递增的公式分析。