At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers do not publish their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus makes performing reproducible and data-centric research in Indonesian NLP even more arduous. Rising to this challenge, we initiate the first Indonesian NLP crowdsourcing effort, NusaCrowd. NusaCrowd strives to provide the largest datasheets aggregation with standardized data loading for NLP tasks in all Indonesian languages. By enabling open and centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle the data scarcity problem hindering NLP progress in Indonesia and bring NLP practitioners to move towards collaboration.
翻译:在阻止印度尼西亚自然语言处理(NLP)研究进展的根本问题的核心,我们发现数据短缺。印度尼西亚语言的资源,特别是当地语言的资源极为稀缺,代表不足。许多印度尼西亚研究人员没有公布他们的数据集。此外,我们分散在不同平台上的少数公共数据集使得印度尼西亚自然语言处理(NLP)的复制和以数据为中心的研究更加艰巨。为了迎接这一挑战,我们发起了第一次印度尼西亚自然语言处理(NLP)众包化工作,NusaCrowd。NusaCrowd努力为印度尼西亚语言的NLP任务提供最大的数据表集,并用所有印度尼西亚语言为NLP任务提供标准化的数据载荷。我们希望NusaCrowd能够开放和集中地访问印度尼西亚的NLP资源,从而解决阻碍印度尼西亚国家语言处理NLP进展的数据稀缺问题,并让NLP执行者走向合作。