量化学习和推断中的相关性 (Quantifying Relevance in Learning and Inference)

Learning is a distinctive feature of intelligent behaviour. High-throughput experimental data and Big Data promise to open new windows on complex systems such as cells, the brain or our societies. Yet, the puzzling success of Artificial Intelligence and Machine Learning shows that we still have a poor conceptual understanding of learning. These applications push statistical inference into uncharted territories where data is high-dimensional and scarce, and prior information on "true" models is scant if not totally absent. Here we review recent progress on understanding learning, based on the notion of "relevance". The relevance, as we define it here, quantifies the amount of information that a dataset or the internal representation of a learning machine contains on the generative model of the data. This allows us to define maximally informative samples, on one hand, and optimal learning machines on the other. These are ideal limits of samples and of machines, that contain the maximal amount of information about the unknown generative process, at a given resolution (or level of compression). Both ideal limits exhibit critical features in the statistical sense: Maximally informative samples are characterised by a power-law frequency distribution (statistical criticality) and optimal learning machines by an anomalously large susceptibility. The trade-off between resolution (i.e. compression) and relevance distinguishes the regime of noisy representations from that of lossy compression. These are separated by a special point characterised by Zipf's law statistics. This identifies samples obeying Zipf's law as the most compressed loss-less representations that are optimal in the sense of maximal relevance. Criticality in optimal learning machines manifests in an exponential degeneracy of energy levels, that leads to unusual thermodynamic properties.

翻译：智能行为的独特特征。高通量实验数据和大数据承诺在细胞、大脑或我们的社会等复杂系统上打开新窗口。然而, 人工智能和机器学习的成功令人费解地表明, 我们对于学习的概念理解仍然不够。这些应用将统计推入未知区域, 那里的数据是高度和稀缺的, 先前关于“ 真实” 模型的信息即使不是完全不存在, 也是很少的。我们在这里根据“ 相关性” 的概念来审查最近在理解学习上的进展。我们在这里定义它时, 量化了一个学习机器在数据发源模型上所含的数据集或内部表现的信息数量。这让我们能够界定信息最丰富的样本和最佳学习机器。这些样本和机器最理想的极限, 包含关于未知的基因化过程的信息量, 在给定的分辨率( 或缩放程度) 两种理想的界限在统计意义上具有关键特征: 最充分信息性的样本由权力- 频率分布( 最精确的精确度) 和最精确的精确的精确度( 精确度) 数据分析, 以及最精确的精确的精确的精确的排序, 和最精确的机器的解析的深度的深度的深度分析,, 的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度分析,是,,, 和最精确的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度的深度,, 。