By considering the spatial correspondence, dense self-supervised representation learning has achieved superior performance on various dense prediction tasks. However, the pixel-level correspondence tends to be noisy because of many similar misleading pixels, e.g., backgrounds. To address this issue, in this paper, we propose to explore \textbf{set} \textbf{sim}ilarity (SetSim) for dense self-supervised representation learning. We generalize pixel-wise similarity learning to set-wise one to improve the robustness because sets contain more semantic and structure information. Specifically, by resorting to attentional features of views, we establish corresponding sets, thus filtering out noisy backgrounds that may cause incorrect correspondences. Meanwhile, these attentional features can keep the coherence of the same image across different views to alleviate semantic inconsistency. We further search the cross-view nearest neighbours of sets and employ the structured neighbourhood information to enhance the robustness. Empirical evaluations demonstrate that SetSim is superior to state-of-the-art methods on object detection, keypoint detection, instance segmentation, and semantic segmentation.
翻译:通过考虑空间通信,密集的自我监督的代表学习在各种密集的预测任务上取得了优异的成绩。 但是,像素级的通信由于许多类似的误导像素,例如背景,往往会吵得要命。 为了解决这个问题,我们在本文件中提议探索\ textbf{set}\ textbff{sim}ility(SetSim), 以便进行密集的自我监督的代表学习。 我们一般地将像素类相似性学习设定为一种方法来改进强健性, 因为各组包含更多的语义和结构信息。 具体地说, 我们通过采用关注的观察特征, 建立相应的套件, 从而过滤可能造成不正确通信的吵闹背景。 与此同时, 这些关注特征可以保持不同观点的同一图像的一致性, 以缓解语义上的不一致性。 我们进一步搜索各组相近相邻的交叉视图, 并使用结构化的邻区段信息来增强强健性。 Epicalal 评估表明, SetSim 在物体探测、 关键点探测、 区段和断断段等段方面, 都优于最先进的方法。