Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset leveraging prior knowledge of a labeled set comprising disjoint but related classes. Existing research focuses primarily on utilizing the labeled set at the methodological level, with less emphasis on the analysis of the labeled set itself. Thus, in this paper, we rethink novel class discovery from the labeled set and focus on two core questions: (i) Given a specific unlabeled set, what kind of labeled set can best support novel class discovery? (ii) A fundamental premise of NCD is that the labeled set must be related to the unlabeled set, but how can we measure this relation? For (i), we propose and substantiate the hypothesis that NCD could benefit more from a labeled set with a large degree of semantic similarity to the unlabeled set. Specifically, we establish an extensive and large-scale benchmark with varying degrees of semantic similarity between labeled/unlabeled datasets on ImageNet by leveraging its hierarchical class structure. As a sharp contrast, the existing NCD benchmarks are developed based on labeled sets with different number of categories and images, and completely ignore the semantic relation. For (ii), we introduce a mathematical definition for quantifying the semantic similarity between labeled and unlabeled sets. In addition, we use this metric to confirm the validity of our proposed benchmark and demonstrate that it highly correlates with NCD performance. Furthermore, without quantitative analysis, previous works commonly believe that label information is always beneficial. However, counterintuitively, our experimental results show that using labels may lead to sub-optimal outcomes in low-similarity settings.
翻译:创新类发现 (NCD) 的目的是在未贴标签的数据集中推断创新类别。 现有研究主要侧重于在方法层面使用标签集, 较少强调对标签集本身的分析。 因此, 在本文中, 我们重新思考标签集中的新类发现, 并侧重于两个核心问题:(一) 具体未贴标签集, 何种标签集最能支持新颖类发现? (二) NCD的一个基本前提是, 标签集必须与未贴标签的数据集相关, 但我们如何总是衡量这一关系? 因为 (一) 我们提议并证实以下假设: NCD可以从标签集中获益更多, 与未贴标签集本身相似。 具体地说, 我们建立了一个广泛和大尺度的基准, 在标签/未贴标签的数据集之间具有不同程度相似性, 借助其等级级别结构, 现有的NCD基准必须与未贴标签集挂钩集相关, 并且我们用不同数量 的精确的标签标定的标定值 。