Co-Salient Object Detection (CoSOD) aims at simulating the human visual system to discover the common and salient objects from a group of relevant images. Recent methods typically develop sophisticated deep learning based models have greatly improved the performance of CoSOD task. But there are still two major drawbacks that need to be further addressed, 1) sub-optimal inter-image relationship modeling; 2) lacking consideration of inter-image separability. In this paper, we propose the Co-Salient Object Detection Transformer (CoSformer) network to capture both salient and common visual patterns from multiple images. By leveraging Transformer architecture, the proposed method address the influence of the input orders and greatly improve the stability of the CoSOD task. We also introduce a novel concept of inter-image separability. We construct a contrast learning scheme to modeling the inter-image separability and learn more discriminative embedding space to distinguish true common objects from noisy objects. Extensive experiments on three challenging benchmarks, i.e., CoCA, CoSOD3k, and Cosal2015, demonstrate that our CoSformer outperforms cutting-edge models and achieves the new state-of-the-art. We hope that CoSformer can motivate future research for more visual co-analysis tasks.
翻译:共振天体探测(COSOD)旨在模拟人类视觉系统,从一组相关图像中发现常见和突出的物体。最近的方法通常开发精密的深层次学习模型,大大改善了COSOD任务的绩效。但是,仍有两大缺陷需要进一步解决,即:1)次优化的图像间分离关系建模;2)缺乏对图像间分离的考虑;在本文件中,我们提议共振天体探测变异器(CoSexer)网络从多个图像中捕捉突出和常见的视觉模式。通过利用变异器结构,拟议方法处理输入订单的影响,大大改善COSOD任务的稳定性。我们还引入了一个新颖的图像间分离概念。我们构建了一个对比性学习计划,以建模不同图像间分离,并学习更具有歧视性的嵌入空间,以区分真实的普通物体和噪音物体。我们在三个具有挑战性的基准上进行了广泛的实验,即COCA、COSOD3k和Cosal2015年的实验。展示了我们Costref-Sexexeximing 未来分析模型能够实现新的视觉模型。