In novel class discovery (NCD), we are given labeled data from seen classes and unlabeled data from unseen classes, and we train clustering models for the unseen classes. However, the implicit assumptions behind NCD are still unclear. In this paper, we demystify assumptions behind NCD and find that high-level semantic features should be shared among the seen and unseen classes. Based on this finding, NCD is theoretically solvable under certain assumptions and can be naturally linked to meta-learning that has exactly the same assumption as NCD. Thus, we can empirically solve the NCD problem by meta-learning algorithms after slight modifications. This meta-learning-based methodology significantly reduces the amount of unlabeled data needed for training and makes it more practical, as demonstrated in experiments. The use of very limited data is also justified by the application scenario of NCD: since it is unnatural to label only seen-class data, NCD is sampling instead of labeling in causality. Therefore, unseen-class data should be collected on the way of collecting seen-class data, which is why they are novel and first need to be clustered.
翻译:在新型类发现(NCD)中,我们从可见类中获得标签数据,从隐蔽类中获得未贴标签数据,我们为隐蔽类进行集群模型培训。然而,NCD背后的隐含假设仍然不清楚。在本文中,我们解开NCD背后的假设,发现高层次语义特征应该由可见类和不可见类共享。基于这一发现,NCD在某些假设中理论上是可以溶解的,并且可以自然地与具有与NCD完全相同的假设的元学习联系起来。因此,我们可以在稍作修改后通过元学习算法从经验上解决NCD问题。这种基于元学习的方法大大减少了培训所需的未贴标签数据的数量,使其更加实用,正如实验所证明的那样。使用非常有限的数据也因NCD的应用设想而有正当理由:由于只贴上可见类数据标签是不正常的,NCD是抽样而不是在因果关系上贴标签。因此,在收集可见类数据的方式上应该收集不可见类数据,这也是为什么这些数据是新颖的,首先需要分组。