Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used.
翻译:高超回归率(RWR)属于一个以预期-最大化框架为基础的广为人知的迭代强化学习算法大家庭。在这个大家庭中,每次迭代的学习都包括利用现行政策对一组轨迹进行抽样抽样,并采用新政策最大限度地提高回报加权日志的动作。虽然据知RWR在某些情况下会产生政策单一的改进,但RWR是否和在何种条件下与最佳政策趋同仍然是开放的问题。 在本文中,我们首次提供了证据,证明RWR在没有使用函数近似值的情况下会达到全球最佳。