A common assumption exists according to which machine learning models improve their performance when they have more data to learn from. In this study, the authors wished to clarify the dilemma by performing an empirical experiment utilizing novel vocational student data. The experiment compared different machine learning algorithms while varying the number of data and feature combinations available for training and testing the models. The experiment revealed that the increase of data records or their sample frequency does not immediately lead to significant increases in the model accuracies or performance, however the variance of accuracies does diminish in the case of ensemble models. Similar phenomenon was witnessed while increasing the number of input features for the models. The study refutes the starting assumption and continues to state that in this case the significance in data lies in the quality of the data instead of the quantity of the data.
翻译:在这项研究中,作者希望通过利用新的职业学生数据进行实验性实验来澄清这一困境。实验比较了不同的机器学习算法,同时改变了可用于培训和测试模型的数据和特征组合的数量。实验表明,数据记录的增加或其抽样频率并不立即导致模型的准确性或性能的显著提高,但是,在混合模型中,适应性的差异确实减少。类似的现象在增加模型输入特征的同时也出现了。研究驳斥了最初的假设,并继续指出,在这种情况下,数据的重要性在于数据的质量,而不是数据的数量。