Semi-supervised learning (SSL) constructs classifiers from datasets in which only a subset of observations is labelled, a situation that naturally arises because obtaining labels often requires expert judgement or costly manual effort. This motivates methods that integrate labelled and unlabelled data within a learning framework. Most SSL approaches assume that label absence is harmless, typically treated as missing completely at random or ignored, but in practice, the missingness process can be informative, as the chances of an observation being unlabelled may depend on the ambiguity of its feature vector. In such cases, the missingness indicators themselves provide additional information that, if properly modelled, may improve estimation efficiency. The \textbf{SSLfmm} package for R is designed to capture this behaviour by estimating the Bayes' classifier under a finite mixture model in which each component corresponding to a class follows a multivariate normal distribution. It incorporates a mixed-missingness mechanism that combines a missing completely at random (MCAR) component with a (non-ignorable) missing at random (MAR) component, the latter modelling the probability of label missingness as a logistic function of the entropy based on the features. Parameters are estimated via an Expectation--Conditional Maximisation algorithm. In the two-class Gaussian setting with arbitrary covariance matrices, the resulting classifier trained on partially labelled data may, in some cases, achieve a lower misclassification rate than the supervised version in the case where all the labels are known. The package includes a practical tool for modelling and illustrates its performance through simulated examples.
翻译:半监督学习(SSL)通过仅部分观测数据带有标签的数据集构建分类器,这种情况自然出现是因为获取标签通常需要专家判断或昂贵的人工标注。这促使了在学习框架中整合带标签与无标签数据的方法发展。大多数SSL方法假设标签缺失是无害的,通常将其视为完全随机缺失或直接忽略,但在实践中,缺失过程可能具有信息性,因为观测数据未被标注的概率可能取决于其特征向量的模糊程度。在此类情况下,缺失指示符本身提供了额外信息,若建立适当模型,可提升估计效率。\n\nR语言的\\textbf{SSLfmm}软件包旨在通过有限混合模型估计贝叶斯分类器来捕捉此行为,其中每个对应类别的分量遵循多元正态分布。该软件包采用混合缺失机制,将完全随机缺失(MCAR)分量与(不可忽略的)随机缺失(MAR)分量相结合,后者通过基于特征熵的逻辑函数对标签缺失概率进行建模。参数通过期望-条件最大化算法进行估计。在具有任意协方差矩阵的两类高斯设定中,基于部分标注数据训练得到的分类器在某些情况下可能比使用全部已知标签的监督版本获得更低的误分类率。该软件包提供了实用的建模工具,并通过仿真示例展示了其性能。