In Reinforcement Learning (RL), Laplacian Representation (LapRep) is a task-agnostic state representation that encodes the geometry of the environment. A desirable property of LapRep stated in prior works is that the Euclidean distance in the LapRep space roughly reflects the reachability between states, which motivates the usage of this distance for reward shaping. However, we find that LapRep does not necessarily have this property in general: two states having small distance under LapRep can actually be far away in the environment. Such mismatch would impede the learning process in reward shaping. To fix this issue, we introduce a Reachability-Aware Laplacian Representation (RA-LapRep), by properly scaling each dimension of LapRep. Despite the simplicity, we demonstrate that RA-LapRep can better capture the inter-state reachability as compared to LapRep, through both theoretical explanations and experimental results. Additionally, we show that this improvement yields a significant boost in reward shaping performance and also benefits bottleneck state discovery.
翻译:在强化学习(RL)中,拉普拉西亚代表(LapRep)是一个任务不可知的状态代表,它包含环境的几何特征。拉普雷普在先前的著作中描述的一个理想属性是拉普雷普的偏僻点大致反映了拉普雷普空间的欧几里德距离,这促使人们使用这一距离来塑造奖赏。然而,我们发现拉普雷普不一定具有这一属性:拉普雷普下距离小的两个国家在环境中可能实际上距离很远。这种不匹配会阻碍在塑造奖赏方面的学习过程。为了解决这个问题,我们引入了“可达-软件拉普尔西德代表”(RA-Laprep),通过适当扩大拉普雷普的每个维度。尽管简单,我们证明RA-拉普雷普通过理论解释和实验结果可以更好地捕捉到与拉普相比的州之间的可达性。此外,我们表明,这一改进将极大地推动对塑造业绩的奖励,也有利于瓶颈状态的发现。