Most well-established data collection methods currently adopted in NLP depend on the assumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepresented groups are thus not only ignored when making modeling and system design decisions, but also prevented from benefiting from development outcomes achieved through data-driven NLP. This paper aims to address the under-representation of illiterate communities in NLP corpora: we identify potential biases and ethical issues that might arise when collecting data from rural communities with high illiteracy rates in Low-Income Countries, and propose a set of practical mitigation strategies to help future work.
翻译:因此,所收集的社团基本上不能代表全球人口,而全球人口往往是社会上最脆弱和处于社会边缘地位的一些人口,而且往往生活在发展中农村地区,因此,这些代表人数不足的群体不仅在作出建模和系统设计决定时被忽视,而且无法从通过以数据为驱动的NLP取得的发展成果中受益。 本文旨在解决在NLP公司中文盲社区代表人数不足的问题:我们查明在收集低收入国家文盲率高的农村社区的数据时可能出现的潜在偏见和道德问题,并提出一套实际的缓解战略,以帮助今后的工作。