This work explores the capacities of character-based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such approaches to handle productive UGC phenomena, which almost by definition, cannot be seen at training time. Within a strict zero-shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed, and then show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this behavior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation.
翻译:这项工作探索基于性格的神经机翻译能力,以翻译吵闹的用户生成的内容(UGC),重点是探讨处理生产性的UGC现象的这类方法的局限性,这些现象几乎按定义在培训时是看不到的。 在严格的零点假设中,我们首先研究各种用户生成的内容现象对翻译工作绩效的有害影响,我们开发了一个小的附加说明的数据集,然后显示这些模型确实无法处理未知的字母,一旦这些字符出现,就会导致灾难性的翻译失败。我们进一步通过简单而富有洞察力的复制任务试验来证实这一行为,并强调降低词汇尺寸超分度的重要性,以提高机器翻译基于性格的模式的稳健性。