Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy with soft inductive biases in place of hard token boundaries. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by >= 1 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.
翻译:喷雾式NLP系统基本上被端到端神经模型所取代,但几乎所有常用模型仍需要一个明确的象征性步骤。 最近基于数据衍生子词Lexicons的象征化方法比人工制造的代谢器要小得多,但这些技术并非同等适合所有语言,使用任何固定词汇都可能限制一个模型的适应能力。在本文中,我们介绍了CANINE,一个直接在字符序列上运行的神经编码器,没有明确的代谢或词汇,以及一个具有软感应偏差的训练前战略,以取代硬符号边界。为了有效和高效地使用其精细刻的输入器,CANINE将降低输入序列长度的下层取样器与深层变压器堆结合起来,后者的编码环境。CANINE在TyDi QA上比一个类似的MBERT模型高1 F1,这是一个具有挑战性的多语种基准,尽管模型参数减少了28%。