Text error correction aims to correct the errors in text sequences such as those typed by humans or generated by speech recognition models. Previous error correction methods usually take the source (incorrect) sentence as encoder input and generate the target (correct) sentence through the decoder. Since the error rate of the incorrect sentence is usually low (e.g., 10\%), the correction model can only learn to correct on limited error tokens but trivially copy on most tokens (correct tokens), which harms the effective training of error correction. In this paper, we argue that the correct tokens should be better utilized to facilitate effective training and then propose a simple yet effective masking strategy to achieve this goal. Specifically, we randomly mask out a part of the correct tokens in the source sentence and let the model learn to not only correct the original error tokens but also predict the masked tokens based on their context information. Our method enjoys several advantages: 1) it alleviates trivial copy; 2) it leverages effective training signals from correct tokens; 3) it is a plug-and-play module and can be applied to different models and tasks. Experiments on spelling error correction and speech recognition error correction on Mandarin datasets and grammar error correction on English datasets with both autoregressive and non-autoregressive generation models show that our method improves the correction accuracy consistently.
翻译:文本错误校正旨在纠正文本序列中的错误, 如由人类输入或由语音识别模型生成的文字序列中的错误。 以前的错误校正方法通常将源( 不正确) 句作为编码器输入, 并通过解码器生成目标( 校正) 句。 由于错误句的错误率通常较低( 例如 10 ⁇ ), 校正模式只能学习在有限的错误符号上校正错误, 而在大多数符号( 校正符号) 上却只略微复制, 这会损害对错误校正的有效培训。 在本文中, 我们认为正确的符号应该更好地用于促进有效的培训, 然后提出一个简单而有效的掩码战略来实现这一目标。 具体地说, 我们随机将源句中的正确符号部分遮盖出来, 让模型学会不仅更正原始错误符号, 而且还根据上下文信息预测掩码的符号。 我们的方法有几个优点:(1) 它会减轻微小的复制; (2) 它会利用正确的符号的有效培训信号; (3) 它是一个插件模块, 并且可以应用到不同的模型和任务中。 在错误校正的校正校正校正校正校正校正模型上, 显示系统校正校正校正的校正校正的校正的校正校正式校正式校正的校正模式和任务中, 显示的校正的校正式校正式校正式校正式校正式校正式校正式校正式校正式校正式校正式校正式校正式校正式校正式校正式校正式校正式校正制校正式校正式校正式校正式校正式校正式校正式校正制校正制校正制校正制校正制校正制校正式校正和制校正制校正制校正制校正的校正制校正。