The clustering task consists in partitioning elements of a sample into homogeneous groups. Most datasets contain individuals that are ambiguous and intrinsically difficult to attribute to one or another cluster. However, in practical applications, misclassifying individuals is potentially disastrous and should be avoided. To keep the misclassification rate small, one can decide to classify only a part of the sample. In the supervised setting, this approach is well known and referred to as classification with an abstention option. In this paper the approach is revisited in an unsupervised mixture model framework and the purpose is to develop a method that comes with the guarantee that the false clustering rate (FCR) does not exceed a pre-defined nominal level $\alpha$. A new procedure is proposed and shown to be optimal up to a remainder term in the sense that the FCR is controlled and at the same time the number of classified items is maximized. Bootstrap versions of the procedure are shown to improve the performance in numerical experiments. An application to breast cancer data illustrates the benefits of the new approach from a practical viewpoint.
翻译:集群任务包括将样本的元素分成同质组。 大多数数据集包含模糊且本质上难以归属于某一组或另一组的个人。 但是,在实际应用中,错误分类个人可能具有灾难性,应当避免。 要将错误分类率维持在小范围,人们可以决定只对样本的一部分进行分类。 在监督的环境下,这一方法众所周知,并被称为分类,但有一个弃权选项。在本文中,该方法在一个不受监督的混合模型框架中重新讨论,目的是制定一种方法,保证假分类率(FCR)不超过预先确定的名义值$\alpha$。在控制FCR的同时,提出了新的程序,并显示该程序在剩余时间内是最佳的,因为可以控制FCR,同时将分类项目的数量最大化。该程序的启动版本显示可以改进数字实验的性能。对乳腺癌数据的应用从实际角度来说明新方法的好处。</s>