The ability to converse with humans and follow commands in natural language is crucial for intelligent unmanned aerial vehicles (a.k.a. drones). It can relieve people's burden of holding a controller all the time, allow multitasking, and make drone control more accessible for people with disabilities or with their hands occupied. To this end, we introduce Aerial Vision-and-Dialog Navigation (AVDN), to navigate a drone via natural language conversation. We build a drone simulator with a continuous photorealistic environment and collect a new AVDN dataset of over 3k recorded navigation trajectories with asynchronous human-human dialogs between commanders and followers. The commander provides initial navigation instruction and further guidance by request, while the follower navigates the drone in the simulator and asks questions when needed. During data collection, followers' attention on the drone's visual observation is also recorded. Based on the AVDN dataset, we study the tasks of aerial navigation from (full) dialog history and propose an effective Human Attention Aided (HAA) baseline model, which learns to predict both navigation waypoints and human attention. Dataset and code will be released.
翻译:与人交流和遵守自然语言指令的能力对于智能无人驾驶飞行器(a.k.a.a.无人机)至关重要,它可以减轻人们始终持有控制器的负担,允许多任务,并使残疾人或手被占用的人更容易获得无人机控制。为此,我们引入空中视觉和数字导航(AVDN),通过自然语言对话导航无人机。我们建造了无人机模拟器,具有连续的光现实环境,并收集了3k以上记录过的导航轨迹的AVDN新数据集,其中含有指挥官和追随者之间无同步的人类对话。指挥官应请求提供初始导航指示和进一步指导,而追随者则在模拟器中导航无人机,必要时提出问题。在数据收集过程中,还记录了对无人机视觉观察的注意力。我们根据AVDN数据集,研究了从(全面)对话历史中收集的航空导航任务,并提出了有效的人类注意力辅助基线模型(HAAAA),以学习如何预测导航点和释放数据。