Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.
翻译:最近关于MT系统的研究主要侧重于神经机翻译(NMT),对NMT业绩有重大影响的一个因素是提供高质量的平行公司。然而,与诸如德文或意大利文等其他高资源语言相比,朝鲜语的高质量平行公司相对较少。为解决这一问题,AI Hub最近为朝鲜语发布了七类平行公司。在本研究中,我们通过语言调查和文字计数(LIWC)以及若干相关实验,对相应的平行公司的质量进行了深入核查。LIWC是一个字数计算软件程序,可以以多种方式分析公司,并提取语言特征作为字典基础。我们最了解的是,这项研究是首先使用LIWC来分析NMT领域平行公司的情况。我们的调查结果表明,通过我们对LICW和NMT业绩的关联分析,进一步开展研究的方向是提高平行公司的质量。