Recent years have witnessed a trend of applying context frames to boost the performance of object detection as video object detection. Existing methods usually aggregate features at one stroke to enhance the feature. These methods, however, usually lack spatial information from neighboring frames and suffer from insufficient feature aggregation. To address the issues, we perform a progressive way to introduce both temporal information and spatial information for an integrated enhancement. The temporal information is introduced by the temporal feature aggregation model (TFAM), by conducting an attention mechanism between the context frames and the target frame (i.e., the frame to be detected). Meanwhile, we employ a Spatial Transition Awareness Model (STAM) to convey the location transition information between each context frame and target frame. Built upon a transformer-based detector DETR, our PTSEFormer also follows an end-to-end fashion to avoid heavy post-processing procedures while achieving 88.1% mAP on the ImageNet VID dataset. Codes are available at https://github.com/Hon-Wong/PTSEFormer.
翻译:近些年来,出现了一种趋势,即应用上下文框架来提升物体探测作为视频物体探测的性能,现有方法通常在一次中将特征集中起来增强特征,但这些方法通常缺乏相邻框架的空间信息,而且缺乏足够的特征汇总。为了解决问题,我们逐步采用时间信息和空间信息来综合增强。时间信息通过时间特征汇总模型(TFAM)引入,方法是在上下文框架和目标框架(即有待检测的框架)之间建立一种关注机制(即,要检测的框架)来进行关注。与此同时,我们采用了空间过渡意识模型(STAM)来传递每个上下文框架和目标框架之间的位置过渡信息。在基于变压器的检测器DETR上,我们的PTSEFormer还采用了终端到终端的方式,以避免过重的后处理程序,同时在图像网络VID数据集上达到88.1%的 mAP。代码可在https://github.com/Hon-Wong/PTSEFormer上查阅。