In a busy city street, a pedestrian surrounded by distractions can pick out a single sign if it is relevant to their route. Artificial agents in outdoor Vision-and-Language Navigation (VLN) are also confronted with detecting supervisory signal on environment features and location in inputs. To boost the prominence of relevant features in transformer-based architectures without costly preprocessing and pretraining, we take inspiration from priority maps - a mechanism described in neuropsychological studies. We implement a novel priority map module and pretrain on auxiliary tasks using low-sample datasets with high-level representations of routes and environment-related references to urban features. A hierarchical process of trajectory planning - with subsequent parameterised visual boost filtering on visual inputs and prediction of corresponding textual spans - addresses the core challenges of cross-modal alignment and feature-level localisation. The priority map module is integrated into a feature-location framework that doubles the task completion rates of standalone transformers and attains state-of-the-art performance on the Touchdown benchmark for VLN. Code and data are referenced in Appendix C.
翻译:在繁忙的市区街道上,路人若与路途相关,被分心环绕的行人可以挑出一个单一的标志;户外视觉和语言导航(VLN)的人工代理物也面临对环境特征和投入地点的监视信号;为了在没有昂贵的预处理和预培训的情况下,提高变压器建筑中相关特征的显著性,我们从优先地图中得到灵感——神经心理学研究中描述的一种机制;我们使用低缩图数据集,在路线和与环境有关的城市特征方面有高层次的表示力,执行一个新的优先地图模块和辅助任务的预设;轨迹规划的分级过程——随后在视觉投入和相应文字范围预测上进行有参数的视觉增强过滤,处理跨模式调整和地貌级定位的核心挑战;优先地图模块被纳入一个将独立变压器的任务完成率增加一倍的特征定位框架,并在VLN的触地达基准上取得最先进的性能;附录C引用了代码和数据。