Generalized category discovery (GCD) is a problem setting where the goal is to discover novel categories within an unlabelled dataset using the knowledge learned from a set of labelled samples. Recent works in GCD argue that a non-parametric classifier formed using semi-supervised $k$-means can outperform strong baselines which use parametric classifiers as it can alleviate the over-fitting to seen categories in the labelled set. In this paper, we revisit the reason that makes previous parametric classifiers fail to recognise new classes for GCD. By investigating the design choices of parametric classifiers from the perspective of model architecture, representation learning, and classifier learning, we conclude that the less discriminative representations and unreliable pseudo-labelling strategy are key factors that make parametric classifiers lag behind non-parametric ones. Motivated by our investigation, we present a simple yet effective parametric classification baseline that outperforms the previous best methods by a large margin on multiple popular GCD benchmarks. We hope the investigations and the simple baseline can serve as a cornerstone to facilitate future studies. Our code is available at: https://github.com/CVMI-Lab/SimGCD.
翻译:通用类别发现(GCD)是一个问题设置,目的是利用从一组贴标签样本中获得的知识,在未贴标签的数据集中发现新类别。GCD的近期工作认为,使用半监督美元方式形成的非参数分类器可以优于使用参数分类器的强大基线,因为它可以减轻与标签数据集中可见类别过于匹配的情况。在本文件中,我们重新审视使以前的参数分类器无法识别GCD新类别的原因。通过从模型结构、代表性学习和分类学的角度来调查参数分类器的设计选择,我们的结论是,不那么具有歧视性的表示法和不可靠的伪标签战略是使参数分类器落后于非参数的关键因素。受我们调查的驱使,我们提出了一个简单而有效的参数分类基准,它通过多种通用的GCD基准的较大幅度,超越了以往的最佳方法。我们希望调查和简单基线能够成为便利未来研究的基础。我们的代码可以查到:https://githhub.com/CimM-L。