There is no, nor will there ever be, single best clustering algorithm, but we would still like to be able to pinpoint those which are well-performing on certain task types and filter out the systematically disappointing ones. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. Yet, their validity is questionable because the clusterings they promote can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to the reference, ground truth groupings that are provided by experts. The commonly-used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, might not possess all the desirable properties, e.g., they do not identify pathological edge cases correctly. Furthermore, they are not nicely interpretable: it is hard to say what a score of 0.8 really means. Its behaviour might also vary as the number of true clusters changes. This makes comparing clustering algorithms across many benchmark datasets difficult. To remedy this, we propose and analyse a new measure: an asymmetric version of the optimal set-matching accuracy. It is corrected for chance and the imbalancedness of cluster sizes.
翻译:不存在,也永远不会有最佳的组合算法,但我们仍然希望能够确定在某些任务类型上表现良好并过滤系统性令人失望的组合算法。传统的组合算法是使用内部或外部有效性措施评估的。内部措施量化了所获取分区的不同方面,例如集集集集集集集光度平均程度或点分离性。然而,它们的有效性是值得怀疑的,因为它们所促进的组合有时可能是毫无意义的。另一方面,外部措施是比较算法产出与参考值、专家提供的地面真相分组。通常使用的典型分配法相似性分数,如正常的相互信息、Fowlkes-Malows或经调整的Rand指数,可能并不具有所有可取的属性,例如,它们并不正确识别出病理精度或点分离性。此外,它们不易解释:很难说0.8的得分实际上意味着什么。它的行为也可能随着真实的群集变化的数量而变化。这使得难以对许多基准数据集的组合算法作比较。为了纠正这一精确性,我们建议对一个最佳的群集度进行精确性分析。