Recently proposed methods in data subset selection, that is active learning and active sampling, use Fisher information, Hessians, similarity matrices based on gradients, and gradient lengths to estimate how informative data is for a model's training. Are these different approaches connected, and if so, how? We revisit the fundamentals of Bayesian optimal experiment design and show that these recently proposed methods can be understood as approximations to information-theoretic quantities: among them, the mutual information between predictions and model parameters, known as expected information gain or BALD in machine learning, and the mutual information between predictions of acquisition candidates and test samples, known as expected predictive information gain. We develop a comprehensive set of approximations using Fisher information and observed information and derive a unified framework that connects seemingly disparate literature. Although Bayesian methods are often seen as separate from non-Bayesian ones, the sometimes fuzzy notion of "informativeness" expressed in various non-Bayesian objectives leads to the same couple of information quantities, which were, in principle, already known by Lindley (1956) and MacKay (1992).
翻译:最近提出的数据子集选择方法,即积极学习和积极抽样,使用Fisher信息,Hesians,基于梯度的类似矩阵,以及梯度长度来估计信息数据对于模型培训的意义。这些不同方法是否相互连接,如果是,如何连接?我们重新审视巴伊西亚最佳试验设计的基本原理,并表明这些最近提出的方法可以被理解为信息理论数量的近似值:其中,预测和模型参数之间的相互信息,在机器学习中被称为预期信息收益或BALD, 以及获取候选人预测和测试样本之间的相互信息,即预期预测信息收益。我们利用费希尔信息开发一套全面的近似图,观察信息,并形成一个将看起来截然不同的文献连接起来的统一框架。虽然巴伊西亚方法常常被视为与非巴伊西亚国家不同,但在不同非巴伊西亚目标中表达的有时模糊的“信息规范性”概念导致相同的信息数量,原则上Lindley(1956年)和MacKay(1992年)已经知道的信息数量。