培训数据子集搜索 (Training Data Subset Search with Ensemble Active Learning)

Deep Neural Networks (DNNs) often rely on very large datasets for training. Given the large size of such datasets, it is conceivable that they contain certain samples that either do not contribute or negatively impact the DNN's optimization. Modifying the training distribution in a way that excludes such samples could provide an effective solution to both improve performance and reduce training time. In this paper, we propose to scale up ensemble Active Learning (AL) methods to perform acquisition at a large scale (10k to 500k samples at a time). We do this with ensembles of hundreds of models, obtained at a minimal computational cost by reusing intermediate training checkpoints. This allows us to automatically and efficiently perform a training data subset search for large labeled datasets. We observe that our approach obtains favorable subsets of training data, which can be used to train more accurate DNNs than training with the entire dataset. We perform an extensive experimental study of this phenomenon on three image classification benchmarks (CIFAR-10, CIFAR-100 and ImageNet), as well as an internal object detection benchmark for prototyping perception models for autonomous driving. Unlike existing studies, our experiments on object detection are at the scale required for production-ready autonomous driving systems. We provide insights on the impact of different initialization schemes, acquisition functions and ensemble configurations at this scale. Our results provide strong empirical evidence that optimizing the training data distribution can provide significant benefits on large scale vision tasks.

翻译：深心神经网络(DNNS)往往依靠非常庞大的数据集进行培训。鉴于这类数据集规模庞大,可以想象它们含有某些样品,这些样品既无助于DNN的优化,又不会对DNN的优化产生消极影响。修改培训分布方式,将这种样品排除在外,可以提供一种有效的解决办法,既改进性能,又减少培训时间。在本文件中,我们提议扩大混合积极学习(AL)方法,以便大规模进行采购(10千至500千克一次抽样),我们这样做的办法是利用成堆数百种模型,这些模型是通过重新使用中级培训检查站以最低计算成本获得的。这样可以使我们自动和高效地对大型标签数据集进行培训子类搜索。我们发现,我们的方法获得的一组培训数据可以比整个数据集的培训更准确。我们在三个图像分类基准(CIFAR-10、CIFAR-100和图像网络)上对这一现象进行了广泛的实验性研究,并提供了在大规模成本配置模型上进行内部目标检测的基准,从而可以实现自主地定位。我们现有的数据获取模型,而我们现有的研究可以提供不同程度的模型,为自主地进行所需的获取结果的模型。