In semi-supervised learning for classification, it is assumed that every ground truth class of data is present in the small labelled dataset. Many real-world sparsely-labelled datasets are plausibly not of this type. It could easily be the case that some classes of data are found only in the unlabelled dataset -- perhaps the labelling process was biased -- so we do not have any labelled examples to train on for some classes. We call this learning regime $\textit{semi-unsupervised learning}$, an extreme case of semi-supervised learning, where some classes have no labelled exemplars in the training set. First, we outline the pitfalls associated with trying to apply deep generative model (DGM)-based semi-supervised learning algorithms to datasets of this type. We then show how a combination of clustering and semi-supervised learning, using DGMs, can be brought to bear on this problem. We study several different datasets, showing how one can still learn effectively when half of the ground truth classes are entirely unlabelled and the other half are sparsely labelled.
翻译:在半监督的分类学习中,可以假定每个地面数据类别都存在于贴有标签的小型数据集中。许多真实世界鲜有标签的数据集与这种类型的数据类别不同。可能很容易发生的情况是,某些类数据只在未贴标签的数据集中找到 -- -- 也许标签程序有偏差 -- -- 因此,我们没有为某些类培训任何贴标签的例子。我们称之为这个学习制度$\textit{semi-unvisued learning},这是一个半监督学习的极端案例,有些类在培训中没有贴有标签的Exemplars。首先,我们勾勒出与试图将深度基因模型(DGM)的半监督的学习算法应用于这类数据集有关的陷阱。然后我们用DGMs来展示组合和半监督学习的组合组合和半监督学习方法如何能对这一问题产生影响。我们研究了几个不同的数据集,显示当一半地面的真理类完全没有标签,而另一半类则被粗略地标定时,人们如何有效地学习。