Semi-supervised learning (SSL) constructs classifiers from datasets in which only a subset of observations is labelled, a situation that naturally arises because obtaining labels often requires expert judgement or costly manual effort. This motivates methods that integrate labelled and unlabelled data within a learning framework. Most SSL approaches assume that label absence is harmless, typically treated as missing completely at random or ignored, but in practice, the missingness process can be informative, as the chances of an observation being unlabelled may depend on the ambiguity of its feature vector. In such cases, the missingness indicators themselves provide additional information that, if properly modelled, may improve estimation efficiency. The \textbf{SSLfmm} package for R is designed to capture this behaviour by estimating the Bayes' classifier under a finite mixture model in which each component corresponding to a class follows a multivariate normal distribution. It incorporates a mixed-missingness mechanism that combines a missing completely at random (MCR) component with a (non-ignorable) missing at random (MAR) component, the latter modelling the probability of label missingness as a logistic function of the entropy based on the features. Parameters are estimated via an Expectation--Conditional Maximisation algorithm. In the two-class Gaussian setting with arbitrary covariance matrices, the resulting classifier trained on partially labelled data may, in some cases, achieve a lower misclassification rate than the supervised version in the case where all the labels are known. The package includes a practical tool for modelling and illustrates its performance through simulated examples.
翻译:半监督学习(SSL)通过仅部分观测带有标签的数据集构建分类器,这种情况自然出现,因为获取标签通常需要专家判断或昂贵的人工标注。这促使了在学习框架中整合带标签与无标签数据的方法。大多数SSL方法假设标签缺失是无害的,通常被视为完全随机缺失或被忽略,但在实践中,缺失过程可能具有信息性,因为观测值未被标记的概率可能取决于其特征向量的模糊性。在此类情况下,缺失指示符本身提供了额外信息,若恰当建模,可提升估计效率。\textbf{SSLfmm} R软件包旨在通过估计有限混合模型下的贝叶斯分类器来捕捉此行为,其中每个对应于类别的分量遵循多元正态分布。它融合了混合缺失机制,结合了完全随机缺失(MCR)分量与(不可忽略的)随机缺失(MAR)分量,后者将标签缺失的概率建模为基于特征熵的逻辑函数。参数通过期望条件最大化算法进行估计。在具有任意协方差矩阵的两类高斯设定中,基于部分标签数据训练所得的分类器在某些情况下可能比所有标签已知时的监督版本实现更低的误分类率。该软件包包含实用的建模工具,并通过模拟示例展示了其性能。