Soft Actor-Critic (SAC) is one of the state-of-the-art off-policy reinforcement learning (RL) algorithms that is within the maximum entropy based RL framework. SAC is demonstrated to perform very well in a list of continous control tasks with good stability and robustness. SAC learns a stochastic Gaussian policy that can maximize a trade-off between total expected reward and the policy entropy. To update the policy, SAC minimizes the KL-Divergence between the current policy density and the soft value function density. Reparameterization trick is then used to obtain the approximate gradient of this divergence. In this paper, we propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO), which uses Cross-Entropy Method (CEM) to optimize the policy network of SAC. The initial idea is to use CEM to iteratively sample the closest distribution towards the soft value function density and uses the resultant distribution as a target to update the policy network. For the purpose of reducing the computational complexity, we also introduce a decoupled policy structure that decouples the Gaussian policy into one policy that learns the mean and one other policy that learns the deviation such that only the mean policy is trained by CEM. We show that this decoupled policy structure does converge to a optimal and we also demonstrate by experiments that SAC-CEPO achieves competitive performance against the original SAC.
翻译: Soft Actor- Critic (SAC) 是当前政策密度和软值功能密度之间的最先进的非政策强化学习(RL)算法之一。 SAC 显示在一系列稳定且稳健的连续控制任务中表现得非常好。 SAC 学会了一种随机高斯政策,该政策可以最大限度地权衡预期总报酬与政策激流之间的利弊。 为了更新政策, SAC 将当前政策密度和软值函数密度之间的原始趋同性能最小化。 然后, 重新校准技巧被用来获取这种差异的大致梯度。 在本文中, 我们提议Sft Acor- Critic 与跨 Entropy Policy 优化政策(SAC- CEPO ) 相匹配, 使用交叉反向方法优化 SAC 政策网络。 最初的想法是, 使用 CEM 来反复测试当前最接近软值函数密度的分布, 并使用结果分布作为更新政策网络的目标。 为了降低这一政策的复杂性, 我们也可以学习SEMAC 政策, 学习这种降低这一政策。