Predicting pedestrian behavior is a crucial task for intelligent driving systems. Accurate predictions require a deep understanding of various contextual elements that potentially impact the way pedestrians behave. To address this challenge, we propose a novel framework that relies on different data modalities to predict future trajectories and crossing actions of pedestrians from an ego-centric perspective. Specifically, our model utilizes a cross-modal Transformer architecture to capture dependencies between different data types. The output of the Transformer is augmented with representations of interactions between pedestrians and other traffic agents conditioned on the pedestrian and ego-vehicle dynamics that are generated via a semantic attentive interaction module. Lastly, the context encodings are fed into a multi-stream decoder framework using a gated-shared network. We evaluate our algorithm on public pedestrian behavior benchmarks, PIE and JAAD, and show that our model improves state-of-the-art in trajectory and action prediction by up to 22% and 13% respectively on various metrics. The advantages brought by components of our model are investigated via extensive ablation studies.
 翻译:预测行人行为是智能驾驶系统的一项关键任务。 准确的预测要求深入了解可能影响行人行为方式的各种背景要素。 为了应对这一挑战,我们提出了一个新框架,依靠不同的数据模式,从自我中心的角度预测行人的未来轨迹和跨行人行动。具体地说,我们的模型利用一个跨模式变异器结构来捕捉不同数据类型之间的依赖性。变异器的输出随着行人与以行人和自我车辆互动模块为条件的其他交通代理器之间的相互作用的表达而增加。最后,通过一个封闭式共享网络,将上下文编码输入多流解码框架。我们用公用行人行为基准(PIE和JAAAAD)来评估我们的算法,并显示我们的模型在轨迹和行动预测方面分别改进了22%和13%的轨迹和行动预测。我们模型各组成部分带来的优势通过广泛的反动研究得到调查。