Streaming perception is a fundamental task in autonomous driving that requires a careful balance between the latency and accuracy of the autopilot system. However, current methods for streaming perception are limited as they rely only on the current and adjacent two frames to learn movement patterns, which restricts their ability to model complex scenes, often leading to poor detection results. To address this limitation, we propose LongShortNet, a novel dual-path network that captures long-term temporal motion and integrates it with short-term spatial semantics for real-time perception. Our proposed LongShortNet is notable as it is the first work to extend long-term temporal modeling to streaming perception, enabling spatiotemporal feature fusion. We evaluate LongShortNet on the challenging Argoverse-HD dataset and demonstrate that it outperforms existing state-of-the-art methods with almost no additional computational cost.
翻译:流式感知是自动驾驶中的基本任务,它需要在自主驾驶汽车的延迟和准确性之间进行仔细的平衡。然而,目前的流式感知方法存在局限性,因为它们仅依赖于当前和相邻两帧来学习运动模式,这限制了它们对复杂场景模型的能力,经常导致检测结果不佳。为了解决这个问题,我们提出了一种新颖的双通道网络LongShortNet,它可以捕捉长期的时间运动,并将其与短期的空间语义融合用于实时感知。我们的LongShortNet之所以被注意到是因为它是第一个将长期时间建模扩展到流式感知的方法,从而实现了时空特征融合。我们在具有挑战性的Argoverse-HD数据集上评估了LongShortNet,并证明它在几乎不增加计算成本的情况下优于现有的最先进方法。