Jejueo was classified as critically endangered by UNESCO in 2010. Although diverse efforts to revitalize it have been made, there have been few computational approaches. Motivated by this, we construct two new Jejueo datasets: Jejueo Interview Transcripts (JIT) and Jejueo Single Speaker Speech (JSS). The JIT dataset is a parallel corpus containing 170k+ Jejueo-Korean sentences, and the JSS dataset consists of 10k high-quality audio files recorded by a native Jejueo speaker and a transcript file. Subsequently, we build neural systems of machine translation and speech synthesis using them. All resources are publicly available via our GitHub repository. We hope that these datasets will attract interest of both language and machine learning communities.
翻译:2010年,教科文组织将Jejueo列为严重危害,2010年,教科文组织将Jejueo列为严重危害,尽管为振兴Jejueo做出了多种努力,但很少采用计算方法,为此,我们兴建了两个新的Jejueo数据集:Jejueo采访记录(JIT)和Jejueo单一发言人演讲(JSS),JIT数据集是一个平行的数据集,包含170k+ Jejueo-朝韩判决,JS数据集由10k个高质量的音频文件组成,由一位当地Jejueo语发言者录制,还有一个抄录文件。随后,我们建立了机器翻译和语音合成神经系统。所有资源都可以通过我们的GitHub存储库公开获取。我们希望这些数据集将吸引语言和机器学习界的兴趣。