Most unsupervised NLP models represent each word with a single point or single region in semantic space, while the existing multi-sense word embeddings cannot represent longer word sequences like phrases or sentences. We propose a novel embedding method for a text sequence (a phrase or a sentence) where each sequence is represented by a distinct set of multi-mode codebook embeddings to capture different semantic facets of its meaning. The codebook embeddings can be viewed as the cluster centers which summarize the distribution of possibly co-occurring words in a pre-trained word embedding space. We introduce an end-to-end trainable neural model that directly predicts the set of cluster centers from the input text sequence during test time. Our experiments show that the per-sentence codebook embeddings significantly improve the performances in unsupervised sentence similarity and extractive summarization benchmarks. In phrase similarity experiments, we discover that the multi-facet embeddings provide an interpretable semantic representation but do not outperform the single-facet baseline.
翻译:在语义空间,最不受监督的NLP模型代表着单点或单一区域的每个单词,而现有的多敏感字嵌入器不能代表长字序列,如词组或句子。我们建议为文本序列(一个短语或句子)采用一种新型嵌入方法,其中每个序列都由一套不同的多模式代码库嵌入器组成,以捕捉其含义的不同语义方面。代码簿嵌入器可被视为集集集中心,它总结了在预先训练的词嵌入空间中可能同时出现的单词的分布。我们引入了一个端到端可训练的神经模型,直接预测集集集集集集集在测试期间的输入文字序列中。我们的实验显示,在未经监督的句子相似性和采掘总称基准中,每条词代码嵌入器大大改进了性能。在相似的语句中,我们发现多面嵌入器嵌入器提供了可解释的语义代表,但并不超越单面基线。