NLP community is currently investing a lot more research and resources into development of deep learning models than training data. While we have made a lot of progress, it is now clear that our models learn all kinds of spurious patterns, social biases, and annotation artifacts. Algorithmic solutions have so far had limited success. An alternative that is being actively discussed is more careful design of datasets so as to deliver specific signals. This position paper maps out the arguments for and against data curation, and argues that fundamentally the point is moot: curation already is and will be happening, and it is changing the world. The question is only how much thought we want to invest into that process.
翻译:与培训数据相比,国家学习计划社区目前正在为深层次学习模式的开发投入更多的研究和资源。 虽然我们已经取得了很大的进展,但现在很明显的是,我们的模型学习了各种虚假的模式、社会偏见和批注文物。到目前为止,分析解决方案取得了有限的成功。一个正在积极讨论的替代办法是更仔细地设计数据集,以便提供具体信号。本立场文件描绘了支持和反对数据整理的论点,并论证了这一点根本上是没有实际意义的:曲线已经和将要发生,它正在改变世界。问题只是我们想对这一过程投资多少想法而已。