It is commonly believed that in transfer learning including more pre-training data translates into better performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we take a closer look at the role of the source dataset's composition in transfer learning and present a framework for probing its impact on downstream performance. Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness as well as detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset. In particular, we demonstrate that removing detrimental datapoints identified by our framework improves transfer learning performance from ImageNet on a variety of target tasks. Code is available at https://github.com/MadryLab/data-transfer
翻译:人们普遍认为,在包括更多的培训前数据的转让学习中,包括更多的培训前数据,可以产生更好的业绩;然而,最近的证据表明,从源数据集中删除数据实际上也有帮助。在这项工作中,我们更仔细地研究源数据集在转让学习中组成的作用,并提供一个框架,以探究其对下游业绩的影响。我们的框架产生了新的能力,例如确定转让学习的弱点,以及发现诸如数据渗漏和源数据集中存在误导性实例等病理。特别是,我们证明,消除我们框架确定的有害数据点可以改善从图像网络在各种目标任务方面的转移学习绩效。代码可在https://github.com/MadryLab/data-transt网站查阅。