Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determine the number of clusters in the dataset, which is unfortunately itself an input to most clustering algorithms. All of this before embarking on their actual subject matter work. After quantifying the impact of algorithm and hyperparameter selection, we propose an ensemble clustering framework which can be leveraged with minimal input. It can be used to determine both the number of clusters in the dataset and a suitable choice of algorithm to use for a given dataset. A code library is included in the Conclusion for ease of integration.
翻译:不受监督的学习,更具体地说,集群,都因需要使用该领域的专门知识而受到影响。研究人员必须谨慎和知情地决定使用哪种算法来使用某一数据集的哪一组超参数。此外,研究人员可能需要确定数据集中的组数,不幸的是,这本身就是大多数组数算法的一种输入。在开始实际主题事项工作之前,所有这一切都会受到影响。在量化算法和超参数选择的影响之后,我们提议一个共同的组群框架,可以用最低限度的投入来加以利用。它可以用来确定数据集中的组数数量和用于某一数据集的适当算法选择。为了便于整合,在结论中包括一个代码库。