In machine learning, the performance of a classifier depends on both the classifier model and the separability/complexity of datasets. To quantitatively measure the separability of datasets, we create an intrinsic measure -- the Distance-based Separability Index (DSI), which is independent of the classifier model. We consider the situation in which different classes of data are mixed in the same distribution to be the most difficult for classifiers to separate. We then formally show that the DSI can indicate whether the distributions of datasets are identical for any dimensionality. And we verify the DSI to be an effective separability measure by comparing to several state-of-the-art separability/complexity measures using synthetic and real datasets. Having demonstrated the DSI's ability to compare distributions of samples, we also discuss some of its other promising applications, such as measuring the performance of generative adversarial networks (GANs) and evaluating the results of clustering methods.
翻译:在机器学习中,分类器的性能既取决于分类模型,也取决于数据集的分离性/复杂性。为了量化地测量数据集的分离性,我们创建了一个内在的计量标准 -- -- 独立于分类器模型的远程分离性指数(DSI),我们认为不同类别的数据混合在同一分布中的情况是分类器最难分离的。然后我们正式表明,DSI可以表明数据集的分布是否与任何维度相同。我们通过比较几种最先进的合成和真实数据集的分离性/兼容性计量标准,核实DSI是否是一种有效的分离性计量标准。我们展示了DSI比较样品分布的能力,我们还讨论了其其他一些有希望的应用,例如测量基因对抗网络的性能,评估组合方法的结果。