While artificial intelligence provides the backbone for many tools people use around the world, recent work has brought to attention that the algorithms powering AI are not free of politics, stereotypes, and bias. While most work in this area has focused on the ways in which AI can exacerbate existing inequalities and discrimination, very little work has studied how governments actively shape training data. We describe how censorship has affected the development of Wikipedia corpuses, text data which are regularly used for pre-trained inputs into NLP algorithms. We show that word embeddings trained on Baidu Baike, an online Chinese encyclopedia, have very different associations between adjectives and a range of concepts about democracy, freedom, collective action, equality, and people and historical events in China than its regularly blocked but uncensored counterpart - Chinese language Wikipedia. We examine the implications of these discrepancies by studying their use in downstream AI applications. Our paper shows how government repression, censorship, and self-censorship may impact training data and the applications that draw from them.
翻译:虽然人工智能是世界各地人们使用的许多工具的骨干,但最近的工作已引起人们的注意,促使AI使用的算法并非没有政治、陈规定型和偏见。虽然这一领域的大多数工作侧重于AI如何加剧现有的不平等和歧视,但很少研究政府如何积极塑造培训数据。我们描述审查如何影响维基百科的开发,即文字数据经常用于为NLP算法提供经过预先培训的投入。我们显示,在中国在线百科全书Baidu Baik上培训的文字嵌入了中国在线百科全书,其形容词和关于民主、自由、集体行动、平等、人和中国历史事件的一系列概念之间有着非常不同的联系,而不是经常但未经审查的对应方-中国语言维基百科。我们通过研究这些差异在下游的AI应用中的使用来研究这些差异的影响。我们的文件展示了政府压制、审查和自我检查如何影响培训数据及其应用。