Vision-based navigation requires processing complex information to make task-orientated decisions. Applications include autonomous robots, self-driving cars, and assistive vision for humans. One of the key elements in the process is the extraction and selection of relevant features in pixel space upon which to base action choices, for which Machine Learning techniques are well suited. However, Deep Reinforcement Learning agents trained in simulation often exhibit unsatisfactory results when deployed in the real-world due to perceptual differences known as the $\textit{reality gap}$. An approach that is yet to be explored to bridge this gap is self-attention. In this paper we (1) perform a systematic exploration of the hyperparameter space for self-attention based navigation of 3D environments and qualitatively appraise behaviour observed from different hyperparameter sets, including their ability to generalise; (2) present strategies to improve the agents' generalisation abilities and navigation behaviour; and (3) show how models trained in simulation are capable of processing real world images meaningfully in real time. To our knowledge, this is the first demonstration of a self-attention based agent successfully trained in navigating a 3D action space, using less than 4000 parameters.
翻译:视觉导航要求处理复杂信息,以做出注重任务的决定。应用包括自主机器人、自驾驶汽车和人类辅助视觉。这一过程的一个关键要素是提取和选择作为行动选择基础的像素空间的相关特征,机学习技术非常适合这些选择。然而,在模拟中训练的深强化学习代理器在实际部署时,由于被称为$\text{现实差距}的认知差异,往往出现不满意的结果。一种有待探索以弥补这一差距的方法是自我注意。在本文中,我们(1)系统地探索超参数空间,用于三维环境的自控导航和对不同超参数组所观察到的行为进行定性评估,包括它们的一般能力;(2)目前改进代理人的概括能力和导航行为的战略;(3)显示经过模拟培训的模型如何能够实时有意义地处理真实世界图像。据我们所知,这是在3D行动空间上成功培训的自控代理器首次演示,使用不到400参数。