To achieve autonomy in a priori unknown real-world scenarios, agents should be able to: i) act from high-dimensional sensory observations (e.g., images), ii) learn from past experience to adapt and improve, and iii) be capable of long horizon planning. Classical planning algorithms (e.g. PRM, RRT) are proficient at handling long-horizon planning. Deep learning based methods in turn can provide the necessary representations to address the others, by modeling statistical contingencies between observations. In this direction, we introduce a general-purpose planning algorithm called PALMER that combines classical sampling-based planning algorithms with learning-based perceptual representations. For training these perceptual representations, we combine Q-learning with contrastive representation learning to create a latent space where the distance between the embeddings of two states captures how easily an optimal policy can traverse between them. For planning with these perceptual representations, we re-purpose classical sampling-based planning algorithms to retrieve previously observed trajectory segments from a replay buffer and restitch them into approximately optimal paths that connect any given pair of start and goal states. This creates a tight feedback loop between representation learning, memory, reinforcement learning, and sampling-based planning. The end result is an experiential framework for long-horizon planning that is significantly more robust and sample efficient compared to existing methods.
翻译:为了在不为人知的现实情景中实现自主,代理人应当能够:(一) 从高维感官观测(例如图像)中采取行动,(二) 从过去的经验中学习适应和改进,以及(三) 能够进行长期规划。古典规划算法(例如PRM、RRT)熟练地处理长视距规划。深层次的基于学习的方法反过来又可以通过对观测之间统计意外情况进行建模,为处理其他方面提供必要的代表性。在这方面,我们引入了称为PALMER的普通用途规划算法,将典型的基于抽样的规划算法与基于学习的感知表达法结合起来。为了培训这些概念性代表,我们将“Q学习”与对比性代表制学习结合起来,以便创造潜在空间,使两个州的嵌入之间的距离能够捕捉到它们之间如何轻易地进行最佳政策交错。为了规划,我们重新使用基于抽样的典型的规划算法,以便从一个重新显示缓冲和休整的轨迹段中检索到一个大致最佳的路径,将任何特定的基于学习的基于学习的抽样规划算法连接到一个较紧密的加强和最后的抽样分析框架。