图像数据共变量变化的对比识别 (Contrastive Identification of Covariate Shift in Image Data)

Identifying covariate shift is crucial for making machine learning systems robust in the real world and for detecting training data biases that are not reflected in test data. However, detecting covariate shift is challenging, especially when the data consists of high-dimensional images, and when multiple types of localized covariate shift affect different subspaces of the data. Although automated techniques can be used to detect the existence of covariate shift, our goal is to help human users characterize the extent of covariate shift in large image datasets with interfaces that seamlessly integrate information obtained from the detection algorithms. In this paper, we design and evaluate a new visual interface that facilitates the comparison of the local distributions of training and test data. We conduct a quantitative user study on multi-attribute facial data to compare two different learned low-dimensional latent representations (pretrained ImageNet CNN vs. density ratio) and two user analytic workflows (nearest-neighbor vs. cluster-to-cluster). Our results indicate that the latent representation of our density ratio model, combined with a nearest-neighbor comparison, is the most effective at helping humans identify covariate shift.

翻译：识别共变式变化对于使机器学习系统在现实世界中变得稳健,对于检测测试数据中未反映的培训数据偏差至关重要。然而,检测共变式变化具有挑战性,特别是当数据由高维图像组成,以及当多种类型的局部共变式变化影响数据的不同子空间时。虽然可以使用自动化技术来检测共变式变化的存在,但我们的目标是帮助人类用户描述大型图像数据集与从检测算法中获得的信息无缝融合的界面的共变式变化程度。在本文中,我们设计和评价一个新的视觉界面,以便于比较培训和测试数据在当地的分布。我们对多属性面部数据进行了定量用户研究,以比较两种不同的已知的低维潜在表层(预受培训的图像网络CNN v. 密度比率)和两个用户分析性动态(近邻bor v. 群集至群集) 。我们的结果显示,我们密度模型的潜在表达方式,加上近邻比比较,对于帮助人类识别共变式最为有效。