Streaming perception is a critical task in autonomous driving that requires balancing the latency and accuracy of the autopilot system. However, current methods for streaming perception are limited as they only rely on the current and adjacent two frames to learn movement patterns. This restricts their ability to model complex scenes, often resulting in poor detection results. To address this limitation, we propose LongShortNet, a novel dual-path network that captures long-term temporal motion and integrates it with short-term spatial semantics for real-time perception. LongShortNet is notable as it is the first work to extend long-term temporal modeling to streaming perception, enabling spatiotemporal feature fusion. We evaluate LongShortNet on the challenging Argoverse-HD dataset and demonstrate that it outperforms existing state-of-the-art methods with almost no additional computational cost.
翻译:流式感知是自动驾驶中的一项关键任务,需要平衡自动驾驶系统的延迟和准确性。然而,当前的流式感知方法存在局限性,它们仅依赖于当前和相邻两帧来学习运动模式,这限制了它们对复杂场景的建模能力,往往导致检测结果不佳。为了解决这一限制,我们提出了LongShortNet,这是一种新颖的双通道网络,可以捕捉长期的时间运动,并将其与短期的空间语义集成在一起,实现实时感知。LongShortNet的突出之处在于它是第一个将长期时间建模扩展到流式感知的工作,从而实现了时空特征融合。我们在具有挑战性的Argoverse-HD数据集上评估LongShortNet,并证明它在几乎没有额外计算成本的情况下优于现有的最先进方法。