Research in machine learning (ML) has primarily argued that models trained on incomplete or biased datasets can lead to discriminatory outputs. In this commentary, we propose moving the research focus beyond bias-oriented framings by adopting a power-aware perspective to "study up" ML datasets. This means accounting for historical inequities, labor conditions, and epistemological standpoints inscribed in data. We draw on HCI and CSCW work to support our argument, critically analyze previous research, and point at two co-existing lines of work within our community -- one bias-oriented, the other power-aware. This way, we highlight the need for dialogue and cooperation in three areas: data quality, data work, and data documentation. In the first area, we argue that reducing societal problems to "bias" misses the context-based nature of data. In the second one, we highlight the corporate forces and market imperatives involved in the labor of data workers that subsequently shape ML datasets. Finally, we propose expanding current transparency-oriented efforts in dataset documentation to reflect the social contexts of data design and production.
翻译:机器学习研究(ML)主要认为,在不完整或偏差数据集方面受过培训的模型可能导致歧视性产出。在本评注中,我们建议将研究重点从偏向框架转向偏向框架,采用“权力意识视角”来“研究”ML数据集。这意味着核算数据中列出的历史不平等、劳动条件和认知观点。我们利用HCI和CSCW的工作来支持我们的论点,批判性地分析以往的研究,指出我们社区内存在的两个工作线 -- -- 一种偏向,另一种权力意识。这样,我们强调必须在数据质量、数据工作和数据文件这三个领域进行对话与合作。在第一个领域,我们主张将社会问题减少到“偏见”而忽略了数据基于背景的性质。在第二个领域,我们强调数据工作者的劳动所涉及的公司力量和市场需要,从而形成ML数据集。最后,我们建议扩大当前在数据集文件方面的透明化努力,以反映数据设计和生产的社会背景。