We develop machine translation and speech synthesis systems to complement the efforts of revitalizing Judeo-Spanish, the exiled language of Sephardic Jews, which survived for centuries, but now faces the threat of extinction in the digital age. Building on resources created by the Sephardic community of Turkey and elsewhere, we create corpora and tools that would help preserve this language for future generations. For machine translation, we first develop a Spanish to Judeo-Spanish rule-based machine translation system, in order to generate large volumes of synthetic parallel data in the relevant language pairs: Turkish, English and Spanish. Then, we train baseline neural machine translation engines using this synthetic data and authentic parallel data created from translations by the Sephardic community. For text-to-speech synthesis, we present a 3.5 hour single speaker speech corpus for building a neural speech synthesis engine. Resources, model weights and online inference engines are shared publicly.
翻译:我们开发了机器翻译和语言合成系统,以补充振兴犹太裔西班牙裔犹太裔被流放的语言 -- -- 犹太裔犹太裔犹太人 -- -- 的努力,这些语言已存在几个世纪,但如今面临在数字时代灭绝的威胁。我们利用土耳其和其他地方的土裔犹太裔社区创造的资源,创建了公司和工具,帮助为后代保护这一语言。对于机器翻译,我们首先开发了西班牙语到犹太裔西班牙裔基于规则的机器翻译系统,以便在土耳其语、英语和西班牙语等相关语言中生成大量合成平行数据。然后,我们利用这一合成数据以及来自赛法裔社区翻译的真实平行数据,对神经机器翻译引擎进行了培训。关于文本到语音合成,我们提出了一个3.5小时的单一语音语音材料,用于建设神经语音合成引擎。资源、模型重量和在线推断引擎是公开共享的。