In this work, we study policy-based methods for solving the reinforcement learning problem, where off-policy sampling and linear function approximation are employed for policy evaluation, and various policy update rules, including natural policy gradient (NPG), are considered for policy update. To solve the policy evaluation sub-problem in the presence of the deadly triad, we propose a generic algorithm framework of multi-step TD-learning with generalized importance sampling ratios, which includes two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace. The generic algorithm is single time-scale, has provable finite-sample guarantees, and overcomes the high variance issue in off-policy learning. As for the policy update, we provide a universal analysis using only the contraction property and the monotonicity property of the Bellman operator to establish the geometric convergence under various policy update rules. Importantly, by viewing NPG as an approximate way of implementing policy iteration, we establish the geometric convergence of NPG without introducing regularization, and without using mirror descent type of analysis as in existing literature. Combining the geometric convergence of the policy update with the finite-sample analysis of the policy evaluation, we establish for the first time an overall $\mathcal{O}(\epsilon^{-2})$ sample complexity for finding an optimal policy (up to a function approximation error) using policy-based methods under off-policy sampling and linear function approximation.
翻译:在这项工作中,我们研究以政策为基础的方法来解决强化学习问题,在这种方法中,政策评价采用离政策抽样和线性功能近似值,并考虑各种政策更新规则,包括自然政策梯度(NPG),以更新政策。为了在致命三合一的情况下解决政策评估的次级问题,我们建议了一个多步TD学习通用比重通用算法框架,其中包括两种具体的算法:平均美元和双面Q$Trace。通用算法是单一的时间尺度,具有可调定的定额保证,并克服非政策学习中的高度差异问题。关于政策更新,我们仅利用贝尔曼操作员的收缩属性和单项性属性进行普遍分析,以建立不同政策更新规则下的几何趋一致。重要的是,我们通过将NPG视为执行基于政策推理的大致方法,建立NPG的几何级趋同值,但不引入正规化,也不使用镜底底值分析,在现有的文献中,我们用最精确的底值政策分析的精确性功能,将GLenalimalis=S-alisalis-alisalisa alialisalisalvialvialview view view aliview vicalview 和我们现有政策分析。