We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization. The proposed method, MaxMatch-Dropout, randomly drops words in a search using the maximum matching algorithm. It realizes finetuning with subword regularization for popular pretrained language models such as BERT-base. The experimental results demonstrate that MaxMatch-Dropout improves the performance of text classification and machine translation tasks as well as other subword regularization methods. Moreover, we provide a comparative analysis of subword regularization methods: subword regularization with SentencePiece (Unigram), BPE-Dropout, and MaxMatch-Dropout.
翻译:我们为WordPiece提出了一个子词规范化方法,该方法使用最高匹配算法进行象征性化。拟议方法,MaxMatch-Dropout,在使用最大匹配算法进行搜索时随机投放单词。该方法对诸如BERT基地等受大众培训的语文模式进行子词规范化的微调。实验结果表明,MaxMatch-Dropout改进了文本分类和机器翻译任务以及其他子词规范化方法的性能。此外,我们提供了子词规范化方法的比较分析:与DhanpPiece(Unigram)、BPE-Dropout和MaxMatch-Dropout的子词规范化。