We present an analytical policy update rule that is independent of parameterized function approximators. The update rule is suitable for general stochastic policies with monotonic improvement guarantee. The update rule is derived from a closed-form trust-region solution using calculus of variation, following a new theoretical result that tightens existing bounds for policy search using trust-region methods. An explanation building a connection between the policy update rule and value-function methods is provided. Based on a recursive form of the update rule, an off-policy algorithm is derived naturally, and the monotonic improvement guarantee remains. Furthermore, the update rule extends immediately to multi-agent systems when updates are performed by one agent at a time.
翻译:我们提出了一个独立于参数化功能近似器的分析性政策更新规则。更新规则适合于具有单声道改进保证的一般随机政策。更新规则源于一种使用变异计算法的封闭式信任区域解决办法,该计算法采用新的理论结果,用信任区域方法收紧了政策搜索的现有界限。提供了在政策更新规则和价值功能方法之间建立联系的解释。根据更新规则的循环形式,自然产生一种非政策性算法,单声道改进保证仍然存在。此外,更新规则在由一个代理人一次进行更新时立即扩展到多剂系统。