Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning. Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of image- and text-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds.
翻译:现代机器学习模型复杂,而且经常汇集关于个人投入的惊人数量的信息。在极端的情况下,复杂的模型似乎将整个输入实例都记起来,包括似乎无关紧要的信息(例如文本中的社会保障数字)。在本文中,我们的目标是了解这种记忆化对于准确学习是否必要。我们描述了自然预测问题,其中每一个足够准确的培训算法都必须在预测模型中将关于它培训实例中大部分子集的所有信息编码起来。即使这些实例是高维的,其变异体比样本大小高得多,甚至大多数信息最终与手头的任务无关。此外,我们的结果并不取决于培训算法或用于学习的模型类别。我们的问题很简单,相当自然地是下一个符号预测和分组标签任务。这些任务可以被视为与图像和文本有关的预测问题的抽象。为了确定我们的结果,我们从一个单向的通信问题中减少了我们证明信息复杂性较低的范围。