Hierarchical semantic structures naturally exist in an image dataset, in which several semantically relevant image clusters can be further integrated into a larger cluster with coarser-grained semantics. Capturing such structures with image representations can greatly benefit the semantic understanding on various downstream tasks. Existing contrastive representation learning methods lack such an important model capability. In addition, the negative pairs used in these methods are not guaranteed to be semantically distinct, which could further hamper the structural correctness of learned image representations. To tackle these limitations, we propose a novel contrastive learning framework called Hierarchical Contrastive Selective Coding (HCSC). In this framework, a set of hierarchical prototypes are constructed and also dynamically updated to represent the hierarchical semantic structures underlying the data in the latent space. To make image representations better fit such semantic structures, we employ and further improve conventional instance-wise and prototypical contrastive learning via an elaborate pair selection scheme. This scheme seeks to select more diverse positive pairs with similar semantics and more precise negative pairs with truly distinct semantics. On extensive downstream tasks, we verify the superior performance of HCSC over state-of-the-art contrastive methods, and the effectiveness of major model components is proved by plentiful analytical studies. Our source code and model weights are available at https://github.com/gyfastas/HCSC
翻译:在图像数据集中,自然存在一些与地震有关的图像群集,在图像数据集中,可以进一步将若干具有地震相关性的图像群进一步整合成一个更大的群集,具有粗粗粗的语义编码(HSC) 。用图像显示这种结构可以极大地促进下游任务的语义理解。现有的对比代表性学习方法缺乏如此重要的模型能力。此外,这些方法中使用的负对不能保证具有语义的区别性,这可能会进一步妨碍所学图像的结构性正确性。为了克服这些限制,我们提议了一个新的对比学习框架,称为 " 高压对立选择性选择编码(HSC)。在这个框架中,一组等级原型结构的构建和动态更新能够代表潜在空间数据背后的等级性语义结构。为了使图像表更好地适应这种语义结构,我们采用并进一步改进传统的直观和准性对比性学习方法,通过一个精细的配对选择方法来选择更多样化的正对,具有相似的语义和更精确的负对配,并具有真正不同的定式的语义选择。在这个框架内,我们通过广泛的下游分析系统的主要分析方法来验证我们的高端分析。