In this paper, we propose a self-supervised visual representation learning approach which involves both generative and discriminative proxies, where we focus on the former part by requiring the target network to recover the original image based on the mid-level features. Different from prior work that mostly focuses on pixel-level similarity between the original and generated images, we advocate for Semantic-aware Generation (SaGe) to facilitate richer semantics rather than details to be preserved in the generated image. The core idea of implementing SaGe is to use an evaluator, a deep network that is pre-trained without labels, for extracting semantic-aware features. SaGe complements the target network with view-specific features and thus alleviates the semantic degradation brought by intensive data augmentations. We execute SaGe on ImageNet-1K and evaluate the pre-trained models on five downstream tasks including nearest neighbor test, linear classification, and fine-scaled image recognition, demonstrating its ability to learn stronger visual representations.
翻译:在本文中,我们提出一个自我监督的视觉代表学习方法,它涉及基因化和歧视性替代物,我们侧重于前一部分,要求目标网络根据中层特征恢复原始图像。不同于以往主要侧重于原始图像和生成图像之间的像素级相似性的工作,我们主张SaGe(SaGe)促进更丰富的语义学而不是在生成图像中保存细节。实施SaGe的核心想法是使用一个评估员,这是一个经过预先训练的没有标签的深层网络,用于提取语义认知特征。SaGe以特定视图特征补充目标网络,从而缓解密集数据增强带来的语义退化。我们在图像Net-1K上执行SaGe,并评估五个下游任务的培训前模型,包括最近的邻居测试、线性分类和微小比例图像识别,展示其学习更强视觉描述的能力。