Due to the privacy protection or the difficulty of data collection, we cannot observe individual outputs for each instance, but we can observe aggregated outputs that are summed over multiple instances in a set in some real-world applications. To reduce the labeling cost for training regression models for such aggregated data, we propose an active learning method that sequentially selects sets to be labeled to improve the predictive performance with fewer labeled sets. For the selection measurement, the proposed method uses the mutual information, which quantifies the reduction of the uncertainty of the model parameters by observing the aggregated output. With Bayesian linear basis functions for modeling outputs given an input, which include approximated Gaussian processes and neural networks, we can efficiently calculate the mutual information in a closed form. With the experiments using various datasets, we demonstrate that the proposed method achieves better predictive performance with fewer labeled sets than existing methods.
翻译:由于隐私保护或数据收集的难度,我们无法观察每个案例的单个产出,但我们可以观察某些真实世界应用中多个实例的总产出。为了降低这类汇总数据的回归模型培训的标签成本,我们建议了一种积极的学习方法,按顺序选择一组标签,用较少标签的数据集来改进预测性能。对于选择测量,拟议方法使用相互信息,通过观察总产出来量化模型参数不确定性的减少。根据Bayesian线性基础功能,模型输出的模型功能是一种输入,其中包括近似高斯进程和神经网络,我们可以以封闭的形式有效地计算相互信息。通过使用各种数据集的实验,我们证明拟议方法以比现有方法少标签的数据集实现更好的预测性能。