Pre-trained models have revolutionized natural language understanding. However, researchers have found they can encode artifacts undesired in many applications, such as professions correlating with one gender more than another. We explore such gendered correlations as a case study for how to address unintended correlations in pre-trained models. We define metrics and reveal that it is possible for models with similar accuracy to encode correlations at very different rates. We show how measured correlations can be reduced with general-purpose techniques, and highlight the trade offs different strategies have. With these results, we make recommendations for training robust models: (1) carefully evaluate unintended correlations, (2) be mindful of seemingly innocuous configuration differences, and (3) focus on general mitigations.
翻译:受过培训的模型使自然语言理解发生了革命性的变化,然而,研究人员发现他们可以将许多应用中不理想的文物编码,例如与一个性别比另一个性别更有关联的专业。我们探索了性别相关关系,作为如何解决受过培训的模型中意外关联的案例研究。我们定义了指标,并揭示了具有类似精确度的模型可以以非常不同的速率将相关关系编码。我们展示了如何减少与通用技术的计量相关关系,并突出了不同战略的取舍。通过这些结果,我们提出了培训强健模型的建议:(1) 仔细评估无意关联,(2) 注意似乎无差别的配置差异,(3) 侧重于一般缓解。