The goal of this technical note is to introduce a new finitetime analysis of tabular temporal difference (TD) learning based on discrete-time stochastic linear system models. TD-learning is a fundamental reinforcement learning (RL) algorithm to evaluate a given policy by estimating the corresponding value function for a Markov decision process. While there has been a series of successful works in theoretical analysis of TD-learning, it was not until recently that researchers found some guarantees on its statistical efficiency by developing finite-time error bounds. In this paper, we propose a unique control theoretic finitetime analysis of tabular TD-learning, which directly exploits discrete-time linear system models and standard notions in control communities. The proposed work provides new simple templates and additional insights for analysis of TD-learning and RL algorithms.
翻译:本技术说明的目的是根据离散时间随机线性系统模型,对表列时间差异(TD)学习进行新的有限时间分析。TD-学习是一种基本的强化学习算法,通过估计Markov决定过程的相应价值功能来评价某一政策。虽然在TD-学习的理论分析方面有一系列成功的著作,但直到最近研究人员才通过开发有限时间误差界限而发现其统计效率的一些保障。在本文件中,我们提议对表列TD-学习进行独特的控制性定时分析,直接利用离散时间线性系统模型和控制社区的标准概念。拟议的工作为分析TD-学习和RL算法提供了新的简单模板和更多见解。