The goal of this paper is to investigate a control theoretic analysis of linear stochastic iterative algorithm and temporal difference (TD) learning. TD-learning is a linear stochastic iterative algorithm to estimate the value function of a given policy for a Markov decision process, which is one of the most popular and fundamental reinforcement learning algorithms. While there has been a series of successful works in theoretical analysis of TD-learning, it was not until recently that researchers found some guarantees on its statistical efficiency. In this paper, we propose a control theoretic finite-time analysis TD-learning, which exploits standard notions in linear system control communities. Therefore, the proposed work provides additional insights on TD-learning and reinforcement learning with simple concepts and analysis tools in control theory.
翻译:本文件的目的是对线性随机迭代算法和时间差异(TD)学习进行控制理论分析。TD-学习是一种线性随机迭代算法和时间差异(TD)学习。TD-学习是一种线性随机迭代算法,用来估计Markov决策程序(这是最受欢迎和最基本的强化学习算法之一)的既定政策的价值功能。虽然在TD-学习的理论分析方面有一系列成功的工作,但直到最近研究人员才发现其统计效率有一些保障。在本文中,我们提议对TD-学习进行控制理论性有限时间分析,利用线性系统控制社区的标准概念。因此,拟议的工作提供了对TD-学习和加强学习的更多见解,并提供了控制理论中简单的概念和分析工具。