Making an accurate prediction of occupancy and flow is essential to enable better safety and interaction for autonomous vehicles under complex traffic scenarios. This work proposes STrajNet: a multi-modal Swin Transformerbased framework for effective scene occupancy and flow predictions. We employ Swin Transformer to encode the image and interaction-aware motion representations and propose a cross-attention module to inject motion awareness into grid cells across different time steps. Flow and occupancy predictions are then decoded through temporalsharing Pyramid decoders. The proposed method shows competitive prediction accuracy and other evaluation metrics in the Waymo Open Dataset benchmark.
翻译:这项工作提议STrajNet:一个基于多模式的Swin变异器框架,以便有效地进行现场占用和流动预测。我们使用Swin变异器来编码图像和互动意识动作表达,并提议一个交叉注意模块,在不同时间步骤向电网单元注入运动意识。然后通过时间共享金字塔拆解器解码流动和占用预测。拟议方法显示Waymo开放数据集基准中的竞争性预测准确性和其他评价指标。