Learning self-supervised image representations has been broadly studied to boost various visual understanding tasks. Existing methods typically learn a single level of image semantics like pairwise semantic similarity or image clustering patterns. However, these methods can hardly capture multiple levels of semantic information that naturally exists in an image dataset, e.g., the semantic hierarchy of "Persian cat to cat to mammal" encoded in an image database for species. It is thus unknown whether an arbitrary image self-supervised learning (SSL) approach can benefit from learning such hierarchical semantics. To answer this question, we propose a general framework for Hierarchical Image Representation Learning (HIRL). This framework aims to learn multiple semantic representations for each image, and these representations are structured to encode image semantics from fine-grained to coarse-grained. Based on a probabilistic factorization, HIRL learns the most fine-grained semantics by an off-the-shelf image SSL approach and learns multiple coarse-grained semantics by a novel semantic path discrimination scheme. We adopt six representative image SSL methods as baselines and study how they perform under HIRL. By rigorous fair comparison, performance gain is observed on all the six methods for diverse downstream tasks, which, for the first time, verifies the general effectiveness of learning hierarchical image semantics. All source code and model weights are available at https://github.com/hirl-team/HIRL
翻译:已经广泛研究了学习自我监督的图像表达方式,以提升各种视觉理解任务。 现有方法通常会学习单一层次的图像语义, 如对称语义相似性或图像群集模式。 然而, 这些方法很难捕捉到图像数据集中自然存在的多种语义信息, 例如, “ Persian cat to cat to mamaal” 的语义等级结构, 在物种图像数据库中编码。 因此不清楚任意图像自我监督学习(SSL) 方法是否能从学习这种等级语义学中受益。 为了回答这个问题, 我们提议了一个用于等级化图像代表制学习的通用语义结构框架( HIRL )。 这个框架旨在为每个图像群学习多个语义表达式的语义信息, 这些表达式结构可以将图像的语义从精细刻度到粗糙的语义。 基于一种比较性能系数化, HIRLL 学习最精细的语义学模式, 使用一种离级的 SSL 方法学习多重的精度-graphy-graphy regraphy 结构, 通过一个精度的Slimal 测试方法, 通过一个精度的Slimal L salial 测试方法, 通过一个精度的Slimalalal 进行SL 测试方法, 进行SL 进行所有的SL 的SL 的SL 的SL 。