Acquiring labeled data is challenging in many machine learning applications with limited budgets. Active learning gives a procedure to select the most informative data points and improve data efficiency by reducing the cost of labeling. The info-max learning principle maximizing mutual information such as BALD has been successful and widely adapted in various active learning applications. However, this pool-based specific objective inherently introduces a redundant selection and further requires a high computational cost for batch selection. In this paper, we design and propose a new uncertainty measure, Balanced Entropy Acquisition (BalEntAcq), which captures the information balance between the uncertainty of underlying softmax probability and the label variable. To do this, we approximate each marginal distribution by Beta distribution. Beta approximation enables us to formulate BalEntAcq as a ratio between an augmented entropy and the marginalized joint entropy. The closed-form expression of BalEntAcq facilitates parallelization by estimating two parameters in each marginal Beta distribution. BalEntAcq is a purely standalone measure without requiring any relational computations with other data points. Nevertheless, BalEntAcq captures a well-diversified selection near the decision boundary with a margin, unlike other existing uncertainty measures such as BALD, Entropy, or Mean Standard Deviation (MeanSD). Finally, we demonstrate that our balanced entropy learning principle with BalEntAcq consistently outperforms well-known linearly scalable active learning methods, including a recently proposed PowerBALD, a simple but diversified version of BALD, by showing experimental results obtained from MNIST, CIFAR-100, SVHN, and TinyImageNet datasets.
翻译:在许多有限预算的机器学习应用中,获取标记数据是具有挑战性的。主动学习提供了一种选择最具信息量的数据点并通过降低标记成本来提高数据效率的方法。像BALD这样最大化互信息的信息最大化学习原则在各种主动学习应用中取得了成功并得到广泛采用。然而,该面向池的特定目标天然地引入了冗余选择,并进一步要求批量选择高计算成本。在本文中,我们设计并提出了一种新的不确定性测量方法——平衡熵获取(BalEntAcq),它捕捉了潜在softmax概率的不确定性和标签变量之间的信息平衡。为此,我们通过Beta分布逼近每个边际分布。Beta逼近使我们能够将BalEntAcq形式化为增强熵和边际联合熵之间的比率。BalEntAcq的闭合形式表达式通过估计每个边际Beta分布中的两个参数使得并行化成为可能。BalEntAcq是一种纯粹的独立衡量方法,不需要与其他数据点进行任何关联计算。然而,BalEntAcq捕捉到了一个与决策边界临近并具有边际的分布选择,与其他现有的不确定性测量方法如BALD、Entropy或MeanSD不同。最后,我们通过显示从MNIST、CIFAR-100、SVHN和TinyImageNet数据集获得的实验结果,证明了我们的平衡熵学习原则基于BalEntAcq始终优于众所周知的可线性扩展的主动学习方法,包括最近提出的BALD的一个简单但多元化的版本PowerBALD。