Real-time perception, or streaming perception, is a crucial aspect of autonomous driving that has yet to be thoroughly explored in existing research. To address this gap, we present DAMO-StreamNet, an optimized framework that combines recent advances from the YOLO series with a comprehensive analysis of spatial and temporal perception mechanisms, delivering a cutting-edge solution. The key innovations of DAMO-StreamNet are: (1) A robust neck structure incorporating deformable convolution, enhancing the receptive field and feature alignment capabilities. (2) A dual-branch structure that integrates short-path semantic features and long-path temporal features, improving motion state prediction accuracy. (3) Logits-level distillation for efficient optimization, aligning the logits of teacher and student networks in semantic space. (4) A real-time forecasting mechanism that updates support frame features with the current frame, ensuring seamless streaming perception during inference. Our experiments demonstrate that DAMO-StreamNet surpasses existing state-of-the-art methods, achieving 37.8% (normal size (600, 960)) and 43.3% (large size (1200, 1920)) sAP without using extra data. This work not only sets a new benchmark for real-time perception but also provides valuable insights for future research. Additionally, DAMO-StreamNet can be applied to various autonomous systems, such as drones and robots, paving the way for real-time perception.
翻译:实时感知或流式感知是自主驾驶的关键方面,在现有研究中尚未得到充分探索。为了解决这一问题,我们提出了DAMO-StreamNet,该优化框架将YOLO系列的最新进展与对空间和时间感知机制的全面分析相结合,实现了尖端解决方案。 DAMO-StreamNet的关键创新点是:(1)稳健的颈部结构,采用可变形卷积,提高感受野和特征对齐能力。 (2)双分支结构,将短通道语义特征和长通道时间特征相结合,提高动态状态预测准确性。 (3)分类层级蒸馏,实现高效优化,将教师和学生网络的logits(逻辑抽象)在语义空间中进行对齐。 (4)实时预测机制,使用当前帧更新支持帧特征,确保推理过程中流式感知的无缝性。我们的实验表明,DAMO-StreamNet超越了现有的最先进方法,实现了37.8%(普通尺寸(600,960))和43.3%(大尺寸(1200,1920))的sAP,而不使用额外的数据。这项工作不仅树立了实时感知的新基准,还为未来的研究提供了有价值的见解。此外,DAMO-StreamNet还可以应用于各种自主系统,例如无人机和机器人,为实时感知铺平道路。