In this paper, sample-aware policy entropy regularization is proposed to enhance the conventional policy entropy regularization for better exploration. Exploiting the sample distribution obtainable from the replay buffer, the proposed sample-aware entropy regularization maximizes the entropy of the weighted sum of the policy action distribution and the sample action distribution from the replay buffer for sample-efficient exploration. A practical algorithm named diversity actor-critic (DAC) is developed by applying policy iteration to the objective function with the proposed sample-aware entropy regularization. Numerical results show that DAC significantly outperforms existing recent algorithms for reinforcement learning.
翻译:本文建议,为更好地勘探,加强常规政策整顿整顿,以强化常规政策整顿。利用从重放缓冲获得的样品分布,拟议的试测整顿使政策行动分布加权总和的灵敏度最大化,以及从重播缓冲中获取的样品整顿行动分布最大化。通过对目标功能应用政策迭代,并采用拟议的试测整顿。数字结果显示,发援会大大优于最近用于强化学习的现有算法,从而开发了一个名为“多样性行为者-批评(DAC)”的实用算法。