Forecasting the future states of surrounding traffic participants is a crucial capability for autonomous vehicles. The recently proposed occupancy flow field prediction introduces a scalable and effective representation to jointly predict surrounding agents' future motions in a scene. However, the challenging part is to model the underlying social interactions among traffic agents and the relations between occupancy and flow. Therefore, this paper proposes a novel Multi-modal Hierarchical Transformer network that fuses the vectorized (agent motion) and visual (scene flow, map, and occupancy) modalities and jointly predicts the flow and occupancy of the scene. Specifically, visual and vector features from sensory data are encoded through a multi-stage Transformer module and then a late-fusion Transformer module with temporal pixel-wise attention. Importantly, a flow-guided multi-head self-attention (FG-MSA) module is designed to better aggregate the information on occupancy and flow and model the mathematical relations between them. The proposed method is comprehensively validated on the Waymo Open Motion Dataset and compared against several state-of-the-art models. The results reveal that our model with much more compact architecture and data inputs than other methods can achieve comparable performance. We also demonstrate the effectiveness of incorporating vectorized agent motion features and the proposed FG-MSA module. Compared to the ablated model without the FG-MSA module, which won 2nd place in the 2022 Waymo Occupancy and Flow Prediction Challenge, the current model shows better separability for flow and occupancy and further performance improvements.
翻译:预测交通参与者的未来状况是自主车辆的关键能力。最近提出的使用流量预测引入了可缩放和有效的代表性,以共同预测代理商未来在现场的行动。然而,挑战部分是模拟交通代理商之间的基本社会互动以及占用与流动之间的关系。因此,本文件提出一个新的多模式分级变换网络,将矢量化(代理运动)和视觉(空间流动、地图和占用)模式结合起来,并共同预测现场的流量和占用情况。具体地说,感官数据的视觉和矢量特征通过多阶段的变迁器模块编码,然后是带有时间像素注意的延迟融合变异器模块。重要的是,流动引导多头自留(FG-MSA)模块旨在更好地汇总关于使用量和流动的信息,并模拟它们之间的数学关系。拟议方法在Waymo Open Motional Dataset和与若干州级模型相比较得到全面验证。结果显示,我们模型的模型,以更紧凑易变现的流程结构及滚动模式显示,我们提出的FMISMSMLA模型可以更能化。