Human trajectory forecasting is a key component of autonomous vehicles, social-aware robots and advanced video-surveillance applications. This challenging task typically requires knowledge about past motion, the environment and likely destination areas. In this context, multi-modality is a fundamental aspect and its effective modeling can be beneficial to any architecture. Inferring accurate trajectories is nevertheless challenging, due to the inherently uncertain nature of the future. To overcome these difficulties, recent models use different inputs and propose to model human intentions using complex fusion mechanisms. In this respect, we propose a lightweight attention-based recurrent backbone that acts solely on past observed positions. Although this backbone already provides promising results, we demonstrate that its prediction accuracy can be improved considerably when combined with a scene-aware goal-estimation module. To this end, we employ a common goal module, based on a U-Net architecture, which additionally extracts semantic information to predict scene-compliant destinations. We conduct extensive experiments on publicly-available datasets (i.e. SDD, inD, ETH/UCY) and show that our approach performs on par with state-of-the-art techniques while reducing model complexity.
翻译:人类轨迹预测是自主飞行器、社会觉醒机器人和高级视频监视应用的一个关键组成部分。这一具有挑战性的任务通常要求了解过去运动、环境和可能目的地。在这方面,多模式是一个基本方面,其有效模型可以对任何结构都有益。由于未来固有的不确定性,精确轨迹预测仍然具有挑战性。为了克服这些困难,最近的模型使用不同的投入,并提议使用复杂的聚合机制模拟人类意图。在这方面,我们提议只对过去观察到的位置采取行动的轻量关注经常性骨干。虽然这一骨干已经提供了有希望的结果,但我们表明,如果与现场觉悟的目标估计模块相结合,其预测准确性可以大大提高。为此,我们使用一个基于U-Net结构的共同目标模块,该模块将提取符合景象的目的地的语义信息。我们对公开利用的数据集(即SDD, inD, ETH/UCY)进行广泛的实验,并显示我们的方法在降低复杂度的同时与州级技术进行。