推荐 | 数据有偏差,照样能学对!
Learning from others' mistakes: Avoiding dataset biases without modeling them
发布时间:2020/12/02
导读 / 夕小瑶的卖萌屋:机器学习,作为模仿人类思维方法进行建模的过程,虽然从数据中抽取模型的水平还不如人类,但是在获取偏见(bias)的方面,已经青出于蓝而胜于蓝了。关于机器学习模型偏见产生的机理,谷歌花了59页,从自然语言、图像处理和生物医疗领域进行了详细的分析。结论是,不论数据集多大,必然存在采样偏差,因此模型或多或少总会学到假特征,扩大数据集不是修正模型偏见的终极解决方案。遗憾的是,谷歌并没在文中提出有创见性的改进意见,只是建议大家多做测试。查看原文
作者:Victor Sanh, Thomas Wolf, Yonatan Belinkov, Alexander M. Rush
链接:https://arxiv.org/abs/2012.01300
PDF:https://arxiv.org/pdf/2012.01300.pdf
摘要:State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended underlying task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We consider cases where the bias issues may not be explicitly identified, and show a method for training models that learn to ignore these problematic correlations. Our approach relies on the observation that models with limited capacity primarily learn to exploit biases in the dataset. We can leverage the errors of such limited capacity models to train a more robust model in a product of experts, thus bypassing the need to hand-craft a biased model. We show the effectiveness of this method to retain improvements in out-of-distribution settings even if no particular bias is targeted by the biased model.
学科:Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)