Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via UTF-8, obviating the need for an embedding layer since there are fewer token types (256) than dimensions. Surprisingly, replacing the ubiquitous embedding layer with one-hot representations of each byte does not hurt performance; experiments on byte-to-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models. A deeper investigation reveals that the combination of embeddingless models with decoder-input dropout amounts to token dropout, which benefits byte-to-byte models in particular.
翻译:许多NLP模型运行于由手工制作的象征性规则以及超常的子词诱导算法产生的子标码序列。一个简单通用的替代方案是通过UTF-8代表每个计算机文本作为字节序列,因为象征性类型(256)比尺寸少,因此不需要嵌入层。 令人惊讶的是,用每个字节的一热表示取代无处不在的嵌入层并不影响性能;英语到10种不同语言的字节对字节机翻译实验显示BLEU的不断改进,与字符级甚至标准的子字节级模型相对立。 更深入的调查显示,嵌入无源模型与脱coder-pent diction 辍学相结合等于象征性辍学,这特别有利于字节模式。