The performance of deep reinforcement learning methods prone to degenerate when applied to environments with non-stationary dynamics. In this paper, we utilize the latent context recurrent encoders motivated by recent Meta-RL materials, and propose the Latent Context-based Soft Actor Critic (LC-SAC) method to address aforementioned issues. By minimizing the contrastive prediction loss function, the learned context variables capture the information of the environment dynamics and the recent behavior of the agent. Then combined with the soft policy iteration paradigm, the LC-SAC method alternates between soft policy evaluation and soft policy improvement until it converges to the optimal policy. Experimental results show that the performance of LC-SAC is significantly better than the SAC algorithm on the MetaWorld ML1 tasks whose dynamics changes drasticly among different episodes, and is comparable to SAC on the continuous control benchmark task MuJoCo whose dynamics changes slowly or doesn't change between different episodes. In addition, we also conduct relevant experiments to determine the impact of different hyperparameter settings on the performance of the Lc-SAC algorithm and give the reasonable suggestions of hyperparameter setting.
翻译:在本文中,我们利用最近Meta-RL材料所激发的潜在背景经常编码器,并提议使用基于远程背景的Soft Actor Critic(LC-SAC)方法来解决上述问题。通过尽量减少对比性预测损失功能,学习到的背景变量捕捉环境动态和代理人最近行为的信息。然后结合软政策迭代模式,LC-SAC方法在软政策评价与软政策改进之间互换,直到它与最佳政策趋同。实验结果显示,LC-SAC的性能大大优于MetaWorld ML1任务SAC算法,该算法的动态在不同情况中发生急剧变化,与SAC在连续控制基准任务MuJoCo(其动态变化缓慢或在不同情况之间没有变化)上具有可比性。此外,我们还进行相关实验,以确定不同超光谱仪环境环境对Lc-SAC算法的性能的影响,并给出高焦量计设置的合理建议。