Disaggregated performance metrics across demographic groups are a hallmark of fairness assessments in computer vision. These metrics successfully incentivized performance improvements on person-centric tasks such as face analysis and are used to understand risks of modern models. However, there is a lack of discussion on the vulnerabilities of these measurements for more complex computer vision tasks. In this paper, we consider multi-label image classification and, specifically, object categorization tasks. First, we highlight design choices and trade-offs for measurement that involve more nuance than discussed in prior computer vision literature. These challenges are related to the necessary scale of data, definition of groups for images, choice of metric, and dataset imbalances. Next, through two case studies using modern vision models, we demonstrate that naive implementations of these assessments are brittle. We identify several design choices that look merely like implementation details but significantly impact the conclusions of assessments, both in terms of magnitude and direction (on which group the classifiers work best) of disparities. Based on ablation studies, we propose some recommendations to increase the reliability of these assessments. Finally, through a qualitative analysis we find that concepts with large disparities tend to have varying definitions and representations between groups, with inconsistencies across datasets and annotators. While this result suggests avenues for mitigation through more consistent data collection, it also highlights that ambiguous label definitions remain a challenge when performing model assessments. Vision models are expanding and becoming more ubiquitous; it is even more important that our disparity assessments accurately reflect the true performance of models.
翻译:不同人口群体之间分类的业绩计量是计算机观点公平评估的一个标志。这些衡量标准成功地激励了个人中心任务(如面部分析)的业绩改进,用于理解现代模型的风险。然而,对于这些衡量方法对于更复杂的计算机愿景任务的脆弱性缺乏讨论。在本文件中,我们考虑多标签图像分类,特别是目标分类任务。首先,我们强调衡量方法的设计选择和权衡,这比先前计算机愿景文献中讨论的要细微得多。这些挑战涉及必要的数据规模、图像群体定义、指标选择和数据设置失衡等;然后,通过使用现代愿景模型进行两项案例研究,我们证明这些评估的执行不成熟。我们确定了若干设计选择,这些选择看起来只是执行细节,但从差异的规模和方向(即分类者最擅长的分类方法)来看,对评估结论有重大影响。根据模拟研究,我们提出了一些提高这些评估的可靠性的建议。最后,通过定性分析,我们发现存在较大差异的概念往往会成为更模糊性的定义,同时,在数据收集过程中,还提出更模糊性的定义。