Co-occurrence statistics based word embedding techniques have proved to be very useful in extracting the semantic and syntactic representation of words as low dimensional continuous vectors. In this work, we discovered that dictionary learning can open up these word vectors as a linear combination of more elementary word factors. We demonstrate many of the learned factors have surprisingly strong semantic or syntactic meaning corresponding to the factors previously identified manually by human inspection. Thus dictionary learning provides a powerful visualization tool for understanding word embedding representations. Furthermore, we show that the word factors can help in identifying key semantic and syntactic differences in word analogy tasks and improve upon the state-of-the-art word embedding techniques in these tasks by a large margin.
翻译:以共同生成的基于语言嵌入的统计方法已证明非常有用,有助于提取词的语义和语义表达方式,作为低维持续矢量。在这项工作中,我们发现字典学习可以将这些词矢量打开,作为更基本词因子的线性组合。我们证明,许多学习因素具有与人类检查先前人工确定的因素相当的惊人强烈的语义或语义含义。因此字典学习为理解语言嵌入表示提供了强大的直观化工具。此外,我们表明,字典因素有助于识别文字类比任务中的关键语义和语义差异,并大大改进这些任务中最先进的词嵌入技术。