Semi-supervised learning is extensively applied these days to estimate classifiers from training data in which not all of the labels of the feature vectors are available. With the use of generative models that propose a form for the joint distribution of a feature vector and its ground-truth label, the Bayes' classifier can be estimated via maximum likelihood on partially classified training data. To increase the accuracy of this sample classifier, \cite{ahfock2020apparent} proposed that a missing-label mechanism be adopted and that the Bayes' classifier be estimated on the basis of the full likelihood formed in the framework that models the probability of a missing label given its observed feature vector in terms of its entropy. In the case of two Gaussian classes with a common covariance matrix, it was shown that the accuracy of the classifier so estimated from the partially classified training data can even have lower error rate than if it were estimated from the sample completely classified. Here, we focus on an algorithm for estimating the Bayes' classifier via the full likelihood in the case of multiple Gaussian classes with arbitrary covariance matrices. Different strategies for initializing the algorithm are discussed and illustrated. A new \proglang{R} package with these tools, \texttt{gmmsslm}, is demonstrated on real data.
翻译:这些天里广泛应用了半监督学习来估计培训数据中的分类数据,而培训数据中并不是所有特性矢量的标签都具备。使用基因模型,提出一种形式,用于共同分配特性矢量及其地面真象标签,贝耶斯的分类器可以通过部分分类培训数据的最大可能性来估计。为了提高这一抽样分类器的准确性,\cite{ahfock202020parugent}提议采用一个缺失标签机制,并根据框架中形成的充分可能性来估计贝耶斯的分类器。由于使用基因模型,该模型以其所观察到的特性矢量为模型。在两个高斯级和共同共变异矩阵中,两个分类器的准确性可以通过部分分类培训数据的准确性来估计。为了提高这一样本分类器的准确性,\cite{ahfockock2020parug} 提议采用一种算法,通过多个高斯分类器的全可能性来估计贝耶斯的分类器的概率,以其观察到的特性矢量矢量值表示。在使用任意的正变缩矩阵中,这些分析工具是不同的缩算法。</s>