Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.
翻译:自然语言处理(NLP)通过机器翻译和搜索引擎等技术对社会产生了重大影响。尽管NLP技术取得了成功,但仅对像英语和中文这样的高资源语言广泛使用,而对于许多语言来说,由于缺乏数据资源和基准测试,它仍然无法获得。在这项工作中,我们专注于开发印度尼西亚语言的资源。尽管印度尼西亚是言语多样性第二的国家,但大多数印度尼西亚语言被归类为濒危语言,有些甚至已经灭绝。我们开发了10种印度尼西亚低资源语言的第一个并行资源。我们的资源包括数据集、多任务基准测试和词汇表,以及印度尼西亚语-英语并行数据集。我们提供了广泛的分析,并描述了创建这种资源时遇到的挑战。我们希望我们的工作能引发关于印度尼西亚和其他代表性不足的语言的NLP研究。