将多语种的单词附在未经监督的语义应用的词句和句子中 (Extending Multi-Sense Word Embedding to Phrases and Sentences for Unsupervised Semantic Applications)

Most unsupervised NLP models represent each word with a single point or single region in semantic space, while the existing multi-sense word embeddings cannot represent longer word sequences like phrases or sentences. We propose a novel embedding method for a text sequence (a phrase or a sentence) where each sequence is represented by a distinct set of multi-mode codebook embeddings to capture different semantic facets of its meaning. The codebook embeddings can be viewed as the cluster centers which summarize the distribution of possibly co-occurring words in a pre-trained word embedding space. We introduce an end-to-end trainable neural model that directly predicts the set of cluster centers from the input text sequence during test time. Our experiments show that the per-sentence codebook embeddings significantly improve the performances in unsupervised sentence similarity and extractive summarization benchmarks. In phrase similarity experiments, we discover that the multi-facet embeddings provide an interpretable semantic representation but do not outperform the single-facet baseline.

翻译：在语义空间,最不受监督的NLP模型代表着单点或单一区域的每个单词,而现有的多敏感字嵌入器不能代表长字序列,如词组或句子。我们建议为文本序列(一个短语或句子)采用一种新型嵌入方法,其中每个序列都由一套不同的多模式代码库嵌入器组成,以捕捉其含义的不同语义方面。代码簿嵌入器可被视为集集集中心,它总结了在预先训练的词嵌入空间中可能同时出现的单词的分布。我们引入了一个端到端可训练的神经模型,直接预测集集集集集集集在测试期间的输入文字序列中。我们的实验显示,在未经监督的句子相似性和采掘总称基准中,每条词代码嵌入器大大改进了性能。在相似的语句中,我们发现多面嵌入器嵌入器提供了可解释的语义代表,但并不超越单面基线。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

预训练模型如何用于文本挖掘？看这份KDD2021-UIUC《预训练文本表示:模型与应用在文本挖掘》教程，附200页Slides

专知会员服务

44+阅读 · 2021年8月18日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日