大型数据集的可能性和机器学习中的简易泡沫问题 (Algorithmic Probability of Large Datasets and the Simplicity Bubble Problem in Machine Learning)

When mining large datasets in order to predict new data, limitations of the principles behind statistical machine learning pose a serious challenge not only to the Big Data deluge, but also to the traditional assumptions that data generating processes are biased toward low algorithmic complexity. Even when one assumes an underlying algorithmic-informational bias toward simplicity in finite dataset generators, we show that fully automated, with or without access to pseudo-random generators, computable learning algorithms, in particular those of statistical nature used in current approaches to machine learning (including deep learning), can always be deceived, naturally or artificially, by sufficiently large datasets. In particular, we demonstrate that, for every finite learning algorithm, there is a sufficiently large dataset size above which the algorithmic probability of an unpredictable deceiver is an upper bound (up to a multiplicative constant that only depends on the learning algorithm) for the algorithmic probability of any other larger dataset. In other words, very large and complex datasets are as likely to deceive learning algorithms into a "simplicity bubble" as any other particular dataset. These deceiving datasets guarantee that any prediction will diverge from the high-algorithmic-complexity globally optimal solution while converging toward the low-algorithmic-complexity locally optimal solution. We discuss the framework and empirical conditions for circumventing this deceptive phenomenon, moving away from statistical machine learning towards a stronger type of machine learning based on, or motivated by, the intrinsic power of algorithmic information theory and computability theory.

翻译：当为预测新数据而挖掘大型数据集时,统计机器学习(包括深层次学习)背后原则的局限性不仅对大数据巨头提出了严重挑战,而且对数据生成过程偏向于低算法复杂性的传统假设也提出了严重挑战。即便当人们假设从基本的算法-信息偏向于有限数据集生成器的简单性时,我们也表明完全自动化,无论是否使用假随机生成器,可折叠的学习算法,特别是当前机器学习(包括深层次学习)方法中使用的统计性质算法,总是会自然地或人为地受到足够大的数据集的欺骗。特别是,我们表明,对于每一个有限的理论学算法算法来说,有一个足够大的数据集大小,超过这个范围,一个不可预测的欺骗者的算法概率是一个多倍增常数,只能依赖学习算法,而任何其他更大的数据集的算法。换句话说,非常庞大和复杂的数据集有可能把学习算法的算法误入“简单易懂性”的机器泡沫,就像任何其他特定数据设置的那样。这些扭曲的数据假设性的数据假设将使得任何低的算法性理论和精确化的理论会偏离全球范围的理论,同时从高变化的理论化的理论解释。

相关内容

Machine Learning

关注 2240

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日