Humans have a natural ability to effortlessly comprehend linguistic commands such as "park next to the yellow sedan" and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command "park next to the yellow sedan," RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset with segmentation masks for the regions described by the linguistic commands. A separate test split with concise manoeuvre-oriented commands is provided to assess the practicality of our dataset. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.
翻译:人类自然有能力不遗余力地理解语言指令, 如“ 黄色轿车旁边的停车位 ”, 并且本能地知道车辆应该航行的哪个区域。 将这种能力扩大到自主车辆是创建完全自主的代理器的下一个步骤, 根据人类的命令作出反应和采取行动。 为此,我们提议新的任务 : 参考可导航区域, 即基于语言命令的对导航感兴趣的地面区域 。 RNR 不同于 引用图像分割(RIS ), 它侧重于自然语言表达中提及的物体, 而不是定位可航行区域 。 例如, 将这种能力扩大到自主车辆是创建完全自主的代理器的下一个步骤 。 RIS 的目标是将所推荐的轿车路进行分割, 而 RNR 的目标是在道路上分割拟议的停车区 。 我们推出一个新的数据集, Talk2car- RegSeg, 将现有的 Talk2car 数据集与语言指令中描述的区域的分解面遮罩。 一项单独的测试, 与简明的机动感控指令分开, 以评估我们的数据设置的实用性。 我们用新的RBR 基准 来测量的模型 。