The performance of deep reinforcement learning methods prone to degenerate when applied to environments with non-stationary dynamics. In this paper, we utilize the latent context recurrent encoders motivated by recent Meta-RL materials, and propose the Latent Context-based Soft Actor Critic (LC-SAC) method to address aforementioned issues. By minimizing the contrastive prediction loss function, the learned context variables capture the information of the environment dynamics and the recent behavior of the agent. Then combined with the soft policy iteration paradigm, the LC-SAC method alternates between soft policy evaluation and soft policy improvement until it converges to the optimal policy. Experimental results show that the performance of LC-SAC is significantly better than the SAC algorithm on the MetaWorld ML1 tasks whose dynamics changes drasticly among different episodes, and is comparable to SAC on the continuous control benchmark task MuJoCo whose dynamics changes slowly or doesn't change between different episodes. In addition, we also conduct relevant experiments to determine the impact of different hyperparameter settings on the performance of the LC-SAC algorithm and give the reasonable suggestions of hyperparameter setting.
翻译:在本文中,我们利用最近Meta-RL材料所激发的潜在背景经常编码器,并提议使用基于远程背景的软动作crict(LC-SAC)方法来解决上述问题。通过尽量减少对比性预测损失功能,学习到的背景变量捕捉环境动态和代理人最近行为的信息。然后结合软政策迭代模式,LC-SAC方法在软政策评价与软政策改进之间互换,直到它与最佳政策趋同。实验结果表明,LC-SAC的性能大大优于MetaWorld ML1任务SAC算法,该算法的动态因不同事件而发生急剧变化,与SAC在连续控制基准任务MuJoCo上具有可比性,其动态变化缓慢或在不同事件之间没有变化。此外,我们还进行相关实验,以确定不同超参数环境对LC-SAC算法的性能的影响,并给出高分辨率定值的合理建议。