Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and averaging) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and averaging ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.
翻译:拥有神经网络代表力的自然行为者-作用网络(NAC)及其变体在解决大型国家空间的马尔科夫决策问题方面取得了令人印象深刻的经验性成功。在本文中,我们用神经网络近距离对NAC进行了有限时间分析,并确定了神经网络的作用、正规化和优化技术(例如梯度剪裁和平均),以便在取样复杂性、循环复杂性和对行为者和评论家的过度平衡方面实现可辨别的良好业绩。特别是,我们证明(一) 进行充分的探索,以避免几乎非决定性和严格不完美的政策,从而保证了稳定,以及(二) 正规化导致常规化的MDP的抽样复杂性和网络宽度,从而在政策优化方面产生有利的偏差交易。在这个过程中,我们确定了行为者神经网络的统一近似力对于实现全球政策优化最佳化的重要性,因为分配性转移。