Predicting the behaviors of other agents on the road is critical for autonomous driving to ensure safety and efficiency. However, the challenging part is how to represent the social interactions between agents and output different possible trajectories with interpretability. In this paper, we introduce a neural prediction framework based on the Transformer structure to model the relationship among the interacting agents and extract the attention of the target agent on the map waypoints. Specifically, we organize the interacting agents into a graph and utilize the multi-head attention Transformer encoder to extract the relations between them. To address the multi-modality of motion prediction, we propose a multi-modal attention Transformer encoder, which modifies the multi-head attention mechanism to multi-modal attention, and each predicted trajectory is conditioned on an independent attention mode. The proposed model is validated on the Argoverse motion forecasting dataset and shows state-of-the-art prediction accuracy while maintaining a small model size and a simple training process. We also demonstrate that the multi-modal attention module can automatically identify different modes of the target agent's attention on the map, which improves the interpretability of the model.
翻译:预测路上其他代理商的行为对于自主驾驶以确保安全和效率至关重要。 但是,挑战部分是如何代表代理商和输出方之间的社会互动,以及不同的可解释性。 在本文中,我们引入了一个基于变换器结构的神经预测框架,以模拟互动代理商之间的关系,并在地图路口点吸引目标代理商的注意。具体地说,我们将互动代理商组织成一个图表,并利用多头关注转换器编码器来提取它们之间的关系。为了解决运动预测的多模式性,我们提出了一个多式关注转换器编码器,将多头关注机制改为多式关注,每个预测轨迹都以独立关注模式为条件。拟议的模型在Argoversive动作预测数据集上得到验证,并显示最新预测的准确性,同时保持一个小模型大小和简单培训程序。我们还表明,多式关注模块可以自动识别地图上目标代理商关注的不同模式,从而改进模型的解释性。