Human intention prediction is a growing area of research where an activity in a video has to be anticipated by a vision-based system. To this end, the model creates a representation of the past, and subsequently, it produces future hypotheses about upcoming scenarios. In this work, we focus on pedestrians' early intention prediction in which, from a current observation of an urban scene, the model predicts the future activity of pedestrians that approach the street. Our method is based on a multi-modal transformer that encodes past observations and produces multiple predictions at different anticipation times. Moreover, we propose to learn the attention masks of our transformer-based model (Temporal Adaptive Mask Transformer) in order to weigh differently present and past temporal dependencies. We investigate our method on several public benchmarks for early intention prediction, improving the prediction performances at different anticipation times compared to the previous works.
翻译:人类意图预测是一个日益扩大的研究领域,录像中的活动必须由一个基于愿景的系统来预测。为此,模型可以描述过去,然后对即将到来的情景提出未来假设。在这项工作中,我们侧重于行人早期意图预测,从目前对城市景象的观察中,模型预测了行人今后在街上的活动。我们的方法基于一种多式变压器,该变压器将过去的观测编码起来,并在不同的预测时间作出多种预测。此外,我们提议学习以变压器为基础的模型(临时调整式变压器)的注意面罩,以区别当前和过去的时间依赖性。我们根据若干公共基准调查早期意图预测的方法,改进与以往工程不同的预测时间的预测性能。