Finding the optimal model complexity that minimizes the generalization error (GE) is a key issue of machine learning. For the conventional supervised learning, this task typically involves the bias-variance tradeoff: lowering the bias by making the model more complex entails an increase in the variance. Meanwhile, little has been studied about whether the same tradeoff exists for unsupervised learning. In this study, we propose that unsupervised learning generally exhibits a two-component tradeoff of the GE, namely the model error and the data error -- using a more complex model reduces the model error at the cost of the data error, with the data error playing a more significant role for a smaller training dataset. This is corroborated by training the restricted Boltzmann machine to generate the configurations of the two-dimensional Ising model at a given temperature and the totally asymmetric simple exclusion process with given entry and exit rates. Our results also indicate that the optimal model tends to be more complex when the data to be learned are more complex.
翻译:找到最佳的模型复杂性,以尽量减少一般化错误(GE)是机器学习的一个关键问题。对于常规监督的学习来说,这项任务通常涉及偏差取舍:通过使模型更复杂来降低偏差导致差异增加。与此同时,对于同一权衡是否存在用于未经监督的学习的相同权衡,研究很少。在这项研究中,我们建议未经监督的学习通常显示基因的两部分取舍,即模型错误和数据错误 -- -- 使用更复杂的模型,以数据错误的代价减少模型错误,而数据错误则对较小的培训数据集起着更重要的作用。这通过训练受限制的波茨曼机器在特定温度和完全对称的简单排除过程生成二维Ising模型的配置,加上给定的出入口率和退出率,得到证实。我们的结果还表明,当需要学习的数据更为复杂时,最佳模型往往更为复杂。</s>