Models that can actively seek out the best quality training data hold the promise of more accurate, adaptable, and efficient machine learning. State-of-the-art active learning techniques tend to prefer examples that are the most difficult to classify. While this works well on homogeneous datasets, we find that it can lead to catastrophic failures when performed on multiple distributions with different degrees of label noise or heteroskedasticity. These active learning algorithms strongly prefer to draw from the distribution with more noise, even if their examples have no informative structure (such as solid color images with random labels). To this end, we demonstrate the catastrophic failure of these active learning algorithms on heteroskedastic distributions and propose a fine-tuning-based approach to mitigate these failures. Further, we propose a new algorithm that incorporates a model difference scoring function for each data point to filter out the noisy examples and sample clean examples that maximize accuracy, outperforming the existing active learning techniques on the heteroskedastic datasets. We hope these observations and techniques are immediately helpful to practitioners and can help to challenge common assumptions in the design of active learning algorithms.
翻译:能够积极寻找最佳质量培训数据的模型有望带来更准确、更适应和更高效的机器学习。 最先进的积极学习技术往往更喜欢最难分类的例子。 虽然这在同质数据集方面效果良好,但我们发现,如果以不同程度的标签噪音或三重心性进行多种分布时,它会导致灾难性的失败。 这些积极的学习算法强烈倾向于以更多噪音从分布中提取,即使它们的例子没有信息结构(如带有随机标签的固体彩色图像) 。 为此,我们展示了这些主动学习算法的灾难性失败,并提出了一种基于微调的方法来减轻这些失败。 此外,我们提出了一种新的算法,它包含每个数据点的模型评分功能,以过滤最吵闹的示例和清洁的示例,从而最大限度地提高准确性,超过在四重心数据集上的现有积极学习技术。 我们希望这些观察和技术能立即对实践者有所帮助,并有助于挑战设计积极学习算法时的共同假设。