In this work, we present the first finite-time analysis of the Q-learning algorithm under time-varying learning policies (i.e., on-policy sampling) with minimal assumptions -- specifically, assuming only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $\mathbb{E}[\|Q_k - Q^*\|_\infty^2]$, implying a sample complexity of order $O(1/\epsilon^2)$ for achieving $\mathbb{E}[\|Q_k - Q^*\|_\infty] \le \epsilon$, matching that of off-policy Q-learning but with a worse dependence on exploration-related parameters. We also derive an explicit rate for $\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]$, where $\pi_k$ is the learning policy at iteration $k$. These results reveal that on-policy Q-learning exhibits weaker exploration than its off-policy counterpart but enjoys an exploitation advantage, as its policy converges to an optimal one rather than remaining fixed. Numerical simulations corroborate our theory. Technically, the combination of time-varying learning policies (which induce rapidly time-inhomogeneous Markovian noise) and the minimal assumption on exploration presents significant analytical challenges. To address these challenges, we employ a refined approach that leverages the Poisson equation to decompose the Markovian noise corresponding to the lazy transition matrix into a martingale-difference term and residual terms. To control the residual terms under time inhomogeneity, we perform a sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These tools may further facilitate the analysis of general reinforcement learning algorithms with rapidly time-varying learning policies -- such as single-timescale actor--critic methods and learning-in-games algorithms -- and are of independent interest.
翻译:暂无翻译