Several NLP tasks need the effective representation of text documents. Arora et. al., 2017 demonstrate that simple weighted averaging of word vectors frequently outperforms neural models. SCDV (Mekala et. al., 2017) further extends this from sentences to documents by employing soft and sparse clustering over pre-computed word vectors. However, both techniques ignore the polysemy and contextual character of words. In this paper, we address this issue by proposing SCDV+BERT(ctxd), a simple and effective unsupervised representation that combines contextualized BERT (Devlin et al., 2019) based word embedding for word sense disambiguation with SCDV soft clustering approach. We show that our embeddings outperform original SCDV, pre-train BERT, and several other baselines on many classification datasets. We also demonstrate our embeddings effectiveness on other tasks, such as concept matching and sentence similarity. In addition, we show that SCDV+BERT(ctxd) outperforms fine-tune BERT and different embedding approaches in scenarios with limited data and only few shots examples.
翻译:多个 NLP 任务需要文本文件的有效表述。 Arora 等人, 2017年表明,单向矢量的简单加权平均值往往高于神经模型。 SCDV(Mekala等人, 2017年)通过对预编成的单向矢量采用软和分散的集群,将这一点进一步从句子扩大到文件。但是,两种技术都忽略了单词的多元性和上下文特性。在本文件中,我们通过提出 SCDV+BERT(ctxxd) 来解决这一问题,这是一个简单而有效的、不受监督的表达方式,将背景化的BERT(Devlin等人, 2019年) 为基础,嵌入文字感与SCDV 软集群方法脱钩的单词。我们表明,我们嵌入的词比原SCDV、 train BERT 和许多分类数据集的其他几个基线都超越了原型 SCDV+BERT(cxd) 和不同嵌入方法,例如概念匹配和句相似性。此外,我们还表明, SCDV+BERT(cxd) 仅以有限的图片和几个例子显示, 。