自然语言处理偏见研究中的‘性取向’理论 (Theories of "Sexuality" in Natural Language Processing Bias Research)

from arxiv, 17 pages, 6 tables, 1 figure, undergraduate senior thesis, submitted to The Spectra: The Virginia Engineering and Science Research Journal

In recent years, significant advancements in the field of Natural Language Processing (NLP) have positioned commercialized language models as wide-reaching, highly useful tools. In tandem, there has been an explosion of multidisciplinary research examining how NLP tasks reflect, perpetuate, and amplify social biases such as gender and racial bias. A significant gap in this scholarship is a detailed analysis of how queer sexualities are encoded and (mis)represented by both NLP systems and practitioners. Following previous work in the field of AI fairness, we document how sexuality is defined and operationalized via a survey and analysis of 55 articles that quantify sexuality-based NLP bias. We find that sexuality is not clearly defined in a majority of the literature surveyed, indicating a reliance on assumed or normative conceptions of sexual/romantic practices and identities. Further, we find that methods for extracting biased outputs from NLP technologies often conflate gender and sexual identities, leading to monolithic conceptions of queerness and thus improper quantifications of bias. With the goal of improving sexuality-based NLP bias analyses, we conclude with recommendations that encourage more thorough engagement with both queer communities and interdisciplinary literature.

翻译：近年来，自然语言处理（NLP）领域的显著进展使得商业化语言模型成为广泛普及且高度实用的工具。与此同时，跨学科研究激增，探讨NLP任务如何反映、延续并放大社会偏见（如性别与种族偏见）。现有研究的一个重要空白在于，缺乏对NLP系统及从业者如何编码和（错误）表征酷儿性取向的详细分析。借鉴人工智能公平性领域的先前工作，我们通过对55篇量化基于性取向的NLP偏见的文献进行综述与分析，记录了性取向如何被定义和操作化。研究发现，在大多数文献中，性取向未被明确定义，表明研究依赖于对性/浪漫实践与身份的假定或规范性概念。此外，从NLP技术中提取偏见输出的方法常混淆性别与性身份，导致对酷儿性的单一化理解，从而造成偏见的错误量化。为改进基于性取向的NLP偏见分析，我们最后提出建议，鼓励更深入地结合酷儿社群与跨学科文献。