The primary obstacle to developing technologies for low-resource languages is the lack of representative, usable data. In this paper, we report the deployment of technology-driven data collection methods for creating a corpus of more than 60,000 translations from Hindi to Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. During this process, we help expand information access in Gondi across 2 different dimensions (a) The creation of linguistic resources that can be used by the community, such as a dictionary, children's stories, Gondi translations from multiple sources and an Interactive Voice Response (IVR) based mass awareness platform; (b) Enabling its use in the digital domain by developing a Hindi-Gondi machine translation model, which is compressed by nearly 4 times to enable it's edge deployment on low-resource edge devices and in areas of little to no internet connectivity. We also present preliminary evaluations of utilizing the developed machine translation model to provide assistance to volunteers who are involved in collecting more data for the target language. Through these interventions, we not only created a refined and evaluated corpus of 26,240 Hindi-Gondi translations that was used for building the translation model but also engaged nearly 850 community members who can help take Gondi onto the internet.
翻译:开发低资源语言技术的主要障碍是缺乏具有代表性的可用数据。在本文中,我们报告采用技术驱动的数据收集方法,创建了由印地语到贡迪语的60,000多份译本,这是印度南部和中部约230万部落人民所讲的一种低资源脆弱语言。在这个过程中,我们帮助将贡迪的信息接入扩大到两个不同层面:(a) 创建了社区可以使用的语言资源,例如字典、儿童故事、多种来源的贡迪翻译和基于大众意识的互动式语音反应平台;(b) 开发了印地语-甘地语机器翻译模型,使其在数字领域得以使用,该模型被压缩了近4次,以便能够在低资源边缘装置上和几乎没有互联网连接的地区进行边际部署。我们还对利用开发的机器翻译模型向参与收集更多目标语言数据的志愿人员提供援助的情况进行了初步评价。通过这些干预措施,我们不仅制作了26,240个印地-冈迪语机器翻译的成套材料,而且对数字域进行了评估,从而得以在数字领域加以利用,该翻译模型上还利用了近850名社区翻译。