With the rapid growth of data, it is becoming increasingly difficult to train or improve deep learning models with the right subset of data. We show that this problem can be effectively solved at an additional labeling cost by targeted data subset selection(TSS) where a subset of unlabeled data points similar to an auxiliary set are added to the training data. We do so by using a rich class of Submodular Mutual Information (SMI) functions and demonstrate its effectiveness for image classification on CIFAR-10 and MNIST datasets. Lastly, we compare the performance of SMI functions for TSS with other state-of-the-art methods for closely related problems like active learning. Using SMI functions, we observe ~20-30% gain over the model's performance before re-training with added targeted subset; ~12% more than other methods.
翻译:随着数据迅速增长,培训或改进深层次学习模式和正确的数据子集变得日益困难。我们表明,通过有针对性的数据子集选择(TSS),在培训数据中添加了类似于辅助数据集的未贴标签数据点子集,可以以额外的标签成本来有效解决这一问题。我们这样做的方法是使用丰富的子模块相互信息功能,并展示其在CIFAR-10和MNIST数据集图像分类方面的效力。最后,我们比较了TSS的SMI功能的性能与其他最先进的问题方法,如积极学习问题。我们使用SMI功能,观察到在以附加目标子集进行再培训之前,在模型的性能上取得了大约20-30%的收益;比其他方法多出12%。