Language representations are an efficient tool used across NLP, but they are strife with encoded societal biases. These biases are studied extensively, but with a primary focus on English language representations and biases common in the context of Western society. In this work, we investigate the biases present in Hindi language representations such as caste and religion associated biases. We demonstrate how biases are unique to specific language representations based on the history and culture of the region they are widely spoken in, and also how the same societal bias (such as binary gender associated biases) when investigated across languages is encoded by different words and text spans. With this work, we emphasize on the necessity of social-awareness along with linguistic and grammatical artefacts when modeling language representations, in order to understand the biases encoded.
翻译:在这项工作中,我们调查印地语中存在的偏见,例如种姓和宗教相关偏见;我们证明偏见如何是特定语言中独特的表现方式,这些表现方式基于他们广泛使用的地区的历史和文化;在对不同语言进行调查时,同样的社会偏见(如二元性别相关偏见)如何以不同的文字和文字进行编码;我们强调,在这项工作中,在模拟语言表述时,社会意识与语言和语法手工艺品的必要性,以便理解所编码的偏见。