Q-learning has long been one of the most popular reinforcement learning algorithms, and theoretical analysis of Q-learning has been an active research topic for decades. Although researches on asymptotic convergence analysis of Q-learning have a long tradition, non-asymptotic convergence has only recently come under active study. The main goal of this paper is to investigate new finite-time analysis of asynchronous Q-learning under Markovian observation models via a control system viewpoint. In particular, we introduce a discrete-time time-varying switching system model of Q-learning with diminishing step-sizes for our analysis, which significantly improves recent development of the switching system analysis with constant step-sizes, and leads to \(\mathcal{O}\left( \sqrt{\frac{\log k}{k}} \right)\) convergence rate that is comparable to or better than most of the state of the art results in the literature. In the mean while, a technique using the similarly transformation is newly applied to avoid the difficulty in the analysis posed by diminishing step-sizes. The proposed analysis brings in additional insights, covers different scenarios, and provides new simplified templates for analysis to deepen our understanding on Q-learning via its unique connection to discrete-time switching systems.
翻译:Q-学习长期以来一直是最受欢迎的强化学习算法之一,对Q-学习的理论分析几十年来一直是一个积极的研究课题。虽然对Q-学习的无症状趋同分析的研究具有悠久的传统,但最近才开始积极研究。本文的主要目的是调查对Markovian观察模型下无症状的Q-学习进行新的有限时间分析,通过控制系统的观点,对无症状的Q-学习进行分析。特别是,我们为分析采用了一个离散的、时间分配的Q-时间交换系统模式,其分级尺寸逐渐缩小,大大改进了对Q-学习的无症状趋同分析的近期发展,并大大改进了对Q-学习的无症状趋同分析,并导致最近对Q-非症状分析的最近发展,从而导致(mathcal{O ⁇ left (\ sqrtrt_frac_log k ⁇ \\\\\\\\\\\\right)\ 进行积极研究。本文件的主要目标是通过控制系统的观点,调查对非同步的Q- 趋同文献中的大部分艺术成果的合并率进行比较或较好的合并率分析。 。在采用类似的技术是最近用来避免通过不断缩小的变换式分析中产生的新的变换式分析,在新的变换式分析中提供新的变式的变式分析。