Encoder-decoder models typically only employ words that are frequently used in the training corpus to reduce the computational costs and exclude noise. However, this vocabulary set may still include words that interfere with learning in encoder-decoder models. This paper proposes a method for selecting more suitable words for learning encoders by utilizing not only frequency, but also co-occurrence information, which we capture using the HITS algorithm. We apply our proposed method to two tasks: machine translation and grammatical error correction. For Japanese-to-English translation, this method achieves a BLEU score that is 0.56 points more than that of a baseline. It also outperforms the baseline method for English grammatical error correction, with an F0.5-measure that is 1.48 points higher.
翻译:编码器- 编码器- 编码器模型通常只使用在训练中常用的词来降低计算成本和排除噪音。 但是, 这个词汇组可能仍然包括干扰编码器- 编码器模型中学习的词。 本文提出一种方法, 选择更合适的词来学习编码器, 不仅使用频率, 而且还使用共同生成的信息, 我们使用 HITS 算法来捕捉这些信息。 我们对两项任务应用了我们建议的方法: 机器翻译和语法错误校正。 对于日文到英文的翻译来说, 这个方法的BLEU得分比基线的得分多0. 56 点。 它也比英语文法错误校正基准方法高出1.48点。