When mining large datasets in order to predict new data, limitations of the principles behind statistical machine learning pose a serious challenge not only to the Big Data deluge, but also to the traditional assumptions that data generating processes are biased toward low algorithmic complexity. Even when one assumes an underlying algorithmic-informational bias toward simplicity in finite dataset generators, we show that fully automated, with or without access to pseudo-random generators, computable learning algorithms, in particular those of statistical nature used in current approaches to machine learning (including deep learning), can always be deceived, naturally or artificially, by sufficiently large datasets. In particular, we demonstrate that, for every finite learning algorithm, there is a sufficiently large dataset size above which the algorithmic probability of an unpredictable deceiver is an upper bound (up to a multiplicative constant that only depends on the learning algorithm) for the algorithmic probability of any other larger dataset. In other words, very large and complex datasets are as likely to deceive learning algorithms into a "simplicity bubble" as any other particular dataset. These deceiving datasets guarantee that any prediction will diverge from the high-algorithmic-complexity globally optimal solution while converging toward the low-algorithmic-complexity locally optimal solution. We discuss the framework and empirical conditions for circumventing this deceptive phenomenon, moving away from statistical machine learning towards a stronger type of machine learning based on, or motivated by, the intrinsic power of algorithmic information theory and computability theory.
翻译:当为预测新数据而挖掘大型数据集时,统计机器学习(包括深层次学习)背后原则的局限性不仅对大数据巨头提出了严重挑战,而且对数据生成过程偏向于低算法复杂性的传统假设也提出了严重挑战。 即便当人们假设从基本的算法-信息偏向于有限数据集生成器的简单性时,我们也表明完全自动化,无论是否使用假随机生成器,可折叠的学习算法,特别是当前机器学习(包括深层次学习)方法中使用的统计性质算法,总是会自然地或人为地受到足够大的数据集的欺骗。 特别是,我们表明,对于每一个有限的理论学算法算法来说,有一个足够大的数据集大小,超过这个范围,一个不可预测的欺骗者的算法概率是一个多倍增常数,只能依赖学习算法,而任何其他更大的数据集的算法。 换句话说,非常庞大和复杂的数据集有可能把学习算法的算法误入“简单易懂性”的机器泡沫,就像任何其他特定数据设置的那样。 这些扭曲的数据假设性的数据假设将使得任何低的算法性理论和精确化的理论会偏离全球范围的理论,同时从高变化的理论化的理论解释。