Since BERT (Devlin et al., 2018), learning contextualized word embeddings has been a de-facto standard in NLP. However, the progress of learning contextualized phrase embeddings is hindered by the lack of a human-annotated, phrase-in-context benchmark. To fill this gap, we propose PiC - a dataset of ~28K of noun phrases accompanied by their contextual Wikipedia pages and a suite of three tasks of increasing difficulty for evaluating the quality of phrase embeddings. We find that training on our dataset improves ranking models' accuracy and remarkably pushes Question Answering (QA) models to near-human accuracy which is 95% Exact Match (EM) on semantic search given a query phrase and a passage. Interestingly, we find evidence that such impressive performance is because the QA models learn to better capture the common meaning of a phrase regardless of its actual context. That is, on our Phrase Sense Disambiguation (PSD) task, SotA model accuracy drops substantially (60% EM), failing to differentiate between two different senses of the same phrase under two different contexts. Further results on our 3-task PiC benchmark reveal that learning contextualized phrase embeddings remains an interesting, open challenge.
翻译:自BERT (Devlin等人,2018年) 以来,学习背景化的字嵌入一直是NLP的脱facto标准。然而,学习背景化的字嵌入的进展受到缺少附加说明的文词嵌入基准的阻碍。为了填补这一空白,我们建议PIC——一个包含点名词的数据集,配以其背景维基百科页面,以及一套在评价词嵌入质量方面日益困难的三项任务。我们发现,关于我们数据集的培训提高了排名模型的准确性,并显著地将问题解答(QA)模型推向接近人的准确性,即:95%的语义搜索Exact Match (EM),配以一个查询短语和一段段落。有趣的是,我们发现,这种令人印象深刻的成绩是,因为QA模型学会更好地捕捉一个短语的共同含义,而不管其实际背景如何。这就是,在我们Phrase Sense diambiggguation(PSD) 任务中, Sota A模型精确度大幅下降 (60 % EM),没有区分两个不同背景化背景背景中学习的版本。