Machine learning and deep learning classification models are data-driven, and the model and the data jointly determine their classification performance. It is biased to evaluate the model's performance only based on the classifier accuracy while ignoring the data separability. Sometimes, the model exhibits excellent accuracy, which might be attributed to its testing on highly separable data. Most of the current studies on data separability measures are defined based on the distance between sample points, but this has been demonstrated to fail in several circumstances. In this paper, we propose a new separability measure--the rate of separability (RS), which is based on the data coding rate. We validate its effectiveness as a supplement to the separability measure by comparing it to four other distance-based measures on synthetic datasets. Then, we demonstrate the positive correlation between the proposed measure and recognition accuracy in a multi-task scenario constructed from a real dataset. Finally, we discuss the methods for evaluating the classification performance of machine learning and deep learning models considering data separability.
翻译:机器学习和深层次学习分类模型是数据驱动的,模型和数据共同决定其分类性能。只根据分类精度评价模型的性能,而忽略数据分离性。有时,模型显示出极佳的准确性,这可以归因于对高度分离数据的测试。目前关于数据分离性措施的研究大多是根据抽样点之间的距离来界定的,但在若干情况下,这证明是失败的。在本文件中,我们提议以数据编码率为基础,采用新的分离性测量率(RS),以数据分离性率(RS)为根据。我们验证其有效性,将它与关于合成数据集的另外四项远程计量进行比较,以此补充分离性计量。然后,我们展示了从真实数据集构建的多功能假设中拟议计量和确认准确性之间的积极相关性。最后,我们讨论了评估机器学习和深层学习模型的分类性能的方法,以数据分离性为考虑数据分离性。