Disaggregated evaluations of AI systems, in which system performance is assessed and reported separately for different groups of people, are conceptually simple. However, their design involves a variety of choices. Some of these choices influence the results that will be obtained, and thus the conclusions that can be drawn; others influence the impacts -- both beneficial and harmful -- that a disaggregated evaluation will have on people, including the people whose data is used to conduct the evaluation. We argue that a deeper understanding of these choices will enable researchers and practitioners to design careful and conclusive disaggregated evaluations. We also argue that better documentation of these choices, along with the underlying considerations and tradeoffs that have been made, will help others when interpreting an evaluation's results and conclusions.
翻译:在概念上,对不同人群的系统进行系统性能评估和单独报告的独立评价系统进行分类评价,从概念上讲,是简单的,但是,其设计涉及各种选择,其中一些选择影响到将获得的结果,从而影响可以得出的结论;另一些选择影响分门别类的评价对人的影响,既有益又有害,包括利用数据进行评价的人。我们争辩说,更深入地了解这些选择将使研究人员和从业人员能够设计仔细和结论性分门别类的评价。我们还认为,更好地记录这些选择以及基本考虑和权衡,将有助于他人解释评价结果和结论。