Deep reinforcement learning (DRL) has successfully solved various problems recently, typically with a unimodal policy representation. However, grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance, which may lead to a multimodal policy represented as a mixture-of-experts (MOE). To our best knowledge, present DRL algorithms for general utility do not deploy this method as policy function approximators due to the potential challenge in its differentiability for policy learning. In this work, we propose a probabilistic mixture-of-experts (PMOE) implemented with a Gaussian mixture model (GMM) for multimodal policy, together with a novel gradient estimator for the indifferentiability problem, which can be applied in generic off-policy and on-policy DRL algorithms using stochastic policies, e.g., Soft Actor-Critic (SAC) and Proximal Policy Optimisation (PPO). Experimental results testify the advantage of our method over unimodal polices and two different MOE methods, as well as a method of option frameworks, based on the above two types of DRL algorithms, on six MuJoCo tasks. Different gradient estimations for GMM like the reparameterisation trick (Gumbel-Softmax) and the score-ratio trick are also compared with our method. We further empirically demonstrate the distinguishable primitives learned with PMOE and show the benefits of our method in terms of exploration.
翻译:深入强化学习(DRL)最近成功地解决了各种问题,通常采用单一方式的政策代表形式。然而,掌握一些非非非非非奥式的混合型任务(PMOE)的可辨别技能,对于进一步提高学习效率和性能至关重要,这可能导致以专家混合(MOE)形式体现的多式联运政策。 根据我们的最佳知识,目前一般用途的DRL算法不会将这一方法作为政策功能的比对器,因为其政策学习的差别性能可能存在挑战。在这项工作中,我们提出一种与高斯混合型混合型(GMMM)一起实施的多式专家混合型(PMOE)对于进一步提高其学习效率和性能至关重要,同时为差异性专家混合型混合型政策(MOE)制定新的梯度估计器。 根据Sft Actor-Critict (SAC) 和 Proximal Popimimation(PPPO) 实验结果证明了我们的方法优于不及不全调的双轨性警察混合混合混合混合混合混合混合混合混合混合模式(GMOEMOML 方法,也展示了我们双轨方法。