Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments. Currently, this area is dominated by a series of feature enhancement based methods, which distill beneficial semantic information from multiple frames and generate enhanced features through fusing the distilled information. However, the distillation and fusion operations are usually performed at either frame level or instance level with external guidance using additional information, such as optical flow and feature memory. In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. Moreover, we introduce a geometric similarity measure into the fusion process to alleviate the influence of information distortion caused by noise. As a result, the proposed DSFNet can generate more robust features through the multi-granularity fusion and avoid being affected by the instability of external guidance. To evaluate the proposed DSFNet, we conduct extensive experiments on the ImageNet VID dataset. Notably, the proposed dual semantic fusion network achieves, to the best of our knowledge, the best performance of 84.1\% mAP among the current state-of-the-art video object detectors with ResNet-101 and 85.4\% mAP with ResNeXt-101 without using any post-processing steps.
翻译:由于在复杂环境中捕获的视频序列质量下降,视频对象探测是一项艰巨的任务。目前,这一领域主要以一系列基于地貌强化方法为主,这些方法从多个框架中提取有益的语义信息,并通过蒸馏信息生成增强功能;然而,蒸馏和聚合操作通常在框架一级或实例一级进行,利用光学流和特征内存等额外信息进行外部指导。在这项工作中,我们提议建立一个双重语义融合网络(作为 DSFNet 缩写),以便在没有外部指导的统一融合框架内充分利用框架和实例一级的语义。此外,我们还在聚合过程中采用几何类类似措施,以缓解噪音造成的信息扭曲的影响。因此,拟议的DSFNet可以通过多语义融合和特征等额外信息进行更强有力的功能,并避免受到外部指导不稳定的影响。为了评估拟议的DSFNet,我们在图像网络VID数据集上进行了广泛的实验。值得注意的是,拟议的双重语义融合网络在没有外部指导的情况下,在不使用目前101-101号S-ASM1号后任何最佳的状态,以及我们目前-ASAS-ASASASASASM1M1号最佳的状态, AS-RAAS-AR1M1M1M-MA1M1M1号最佳运行目标,我们最佳的S-AS-AS-AS-MA1M1M-MA1M1M1M1M1M1M1-S-MA1M1号最佳的状态的状态的状态性性能。