In the last decade, machine translation has become a popular means to deal with multilingual digital content. By providing higher quality translations, obfuscating the source language of a text becomes more attractive. In this paper, we analyze the ability to detect the source language from the translated output of two widely used commercial machine translation systems by utilizing machine-learning algorithms with basic textual features like n-grams. Evaluations show that the source language can be reconstructed with high accuracy for documents that contain a sufficient amount of translated text. In addition, we analyze how the document size influences the performance of the prediction, as well as how limiting the set of possible source languages improves the classification accuracy.
翻译:在过去十年中,机器翻译已成为处理多语种数字内容的流行手段。 通过提供更高质量的翻译,混淆文本的来源语言变得更加吸引人。 在本文中,我们分析利用机器学习算法及其基本文字特征如n-gram等,从两种广泛使用的商用机器翻译系统的翻译输出中检测源语言的能力。 评估表明,对于含有足够数量翻译文本的文件,源语言可以非常精确地重建。 此外,我们分析文件大小如何影响预测的性能,以及限制一套可能的源语言如何提高分类准确性。