Much of the progress in contemporary NLP has come from learning representations, such as masked language model (MLM) contextual embeddings, that turn challenging problems into simple classification tasks. But how do we quantify and explain this effect? We adapt general tools from computational learning theory to fit the specific characteristics of text datasets and present a method to evaluate the compatibility between representations and tasks. Even though many tasks can be easily solved with simple bag-of-words (BOW) representations, BOW does poorly on hard natural language inference tasks. For one such task we find that BOW cannot distinguish between real and randomized labelings, while pre-trained MLM representations show 72x greater distinction between real and random labelings than BOW. This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task, enabling comparisons between representations without requiring empirical evaluations that may be sensitive to initializations and hyperparameters. The method provides a fresh perspective on the patterns in a dataset and the alignment of those patterns with specific labels.
翻译:当代NLP的大部分进展来自学习表现,如隐蔽语言模型(MLM)背景嵌入,将挑战性的问题转化为简单的分类任务。但我们如何量化和解释这种效果?我们从计算学习理论中调整一般工具,以适应文本数据集的具体特点,并提出一种方法来评价表述和任务之间的兼容性。尽管许多任务可以通过简单的词包(BOW)表达方式很容易解决,但BOW在硬性自然语言推断任务方面做得很差。对于其中一项任务,我们发现BOW无法区分真实和随机化的标签,而预先培训的MLM表示方式显示实际标签和随机标签之间比BOW更大的区别72x。这种方法为基于分类的NLP任务的困难提供了经校准的定量衡量尺度,使得在不需要对初始化和超参数敏感的实验性评估的情况下对表述进行比较。这种方法为数据集的格局和这些模式与具体标签的一致提供了新的视角。