机器学习数据质量 (Quality of Data in Machine Learning)

from arxiv, Presented in International Workshop on Data Quality for Intelligent Systems (DQIS), which was a co-located event of QRS 2021 (The 21st IEEE International Conference on Software Quality, Reliability, and Security)

A common assumption exists according to which machine learning models improve their performance when they have more data to learn from. In this study, the authors wished to clarify the dilemma by performing an empirical experiment utilizing novel vocational student data. The experiment compared different machine learning algorithms while varying the number of data and feature combinations available for training and testing the models. The experiment revealed that the increase of data records or their sample frequency does not immediately lead to significant increases in the model accuracies or performance, however the variance of accuracies does diminish in the case of ensemble models. Similar phenomenon was witnessed while increasing the number of input features for the models. The study refutes the starting assumption and continues to state that in this case the significance in data lies in the quality of the data instead of the quantity of the data.

翻译：在这项研究中,作者希望通过利用新的职业学生数据进行实验性实验来澄清这一困境。实验比较了不同的机器学习算法,同时改变了可用于培训和测试模型的数据和特征组合的数量。实验表明,数据记录的增加或其抽样频率并不立即导致模型的准确性或性能的显著提高,但是,在混合模型中,适应性的差异确实减少。类似的现象在增加模型输入特征的同时也出现了。研究驳斥了最初的假设,并继续指出,在这种情况下,数据的重要性在于数据的质量,而不是数据的数量。

相关内容

Machine Learning

关注 2241

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

专知会员服务

39+阅读 · 2020年11月3日

【大规模机器学习】综述论文，20页pdf，A Survey on Large-scale Machine

专知会员服务

66+阅读 · 2020年8月13日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【金融机器学习课程资料】Financial Machine Learning

专知会员服务

118+阅读 · 2019年12月24日