The distributed representation of symbols is one of the key technologies in machine learning systems today, playing a pivotal role in modern natural language processing. Traditional word embeddings associate a separate vector with each word. While this approach is simple and leads to good performance, it requires a lot of memory for representing a large vocabulary. To reduce the memory footprint, the default embedding layer in spaCy is a hash embeddings layer. It is a stochastic approximation of traditional embeddings that provides unique vectors for a large number of words without explicitly storing a separate vector for each of them. To be able to compute meaningful representations for both known and unknown words, hash embeddings represent each word as a summary of the normalized word form, subword information and word shape. Together, these features produce a multi-embedding of a word. In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail. Second, we critically evaluate the hash embedding architecture with multi-embeddings on Named Entity Recognition datasets from a variety of domains and languages. The experiments validate most key design choices behind spaCy's embedders, but we also uncover a few surprising results.
翻译:在现代自然语言处理中,传统的单词嵌入将一个单独的矢量与每个单词联系起来。虽然这个方法很简单,并导致良好的表现。虽然这个方法很简单,它需要大量的记忆来代表一个大的词汇。为了减少记忆足迹,在spaCy中的默认嵌入层是一个散状嵌入层。这是一个传统嵌入层的随机近似,它为大量单词提供了独特的矢量,而没有为每个单词明确储存一个单独的矢量。为了能够对已知和未知的单词进行有意义的表达,嵌入每个单词都代表一个词,作为正常的单词格式、子名称信息和单词形状的概要。为了减少记忆足迹,这些特性共同产生一个单词的多重组合。在这个技术报告中,我们展示了一部分历史,并在垃圾存储层中引入嵌入方法。第二,我们严格评价已经将结构嵌入的多组合式实体识别数据集。为了计算出已知和未知的域和语言,我们也可以验证了最关键的设计选择后部嵌入结果。