理解AI数据集中年龄代表性和代表性 (Understanding the Representation and Representativeness of Age in AI Data Sets)

A diverse representation of different demographic groups in AI training data sets is important in ensuring that the models will work for a large range of users. To this end, recent efforts in AI fairness and inclusion have advocated for creating AI data sets that are well-balanced across race, gender, socioeconomic status, and disability status. In this paper, we contribute to this line of work by focusing on the representation of age by asking whether older adults are represented proportionally to the population at large in AI data sets. We examine publicly-available information about 92 face data sets to understand how they codify age as a case study to investigate how the subjects' ages are recorded and whether older generations are represented. We find that older adults are very under-represented; five data sets in the study that explicitly documented the closed age intervals of their subjects included older adults (defined as older than 65 years), while only one included oldest-old adults (defined as older than 85 years). Additionally, we find that only 24 of the data sets include any age-related information in their documentation or metadata, and that there is no consistent method followed across these data sets to collect and record the subjects' ages. We recognize the unique difficulties in creating representative data sets in terms of age, but raise it as an important dimension that researchers and engineers interested in inclusive AI should consider.

翻译：在AI培训数据集中,不同人口群体在AI培训数据集中的不同代表性对于确保模型能够为广大用户发挥作用非常重要。为此,AI公平和包容方面最近的努力倡导创建在种族、性别、社会经济地位和残疾状况之间保持平衡的AI数据集。在本文件中,我们通过侧重于年龄代表性,询问在AI数据集中老年人是否与一般人口成比例地代表了年龄,以此促进这项工作。我们审查了92个面对面数据集的公开可用信息,以了解它们如何将年龄编成一个案例研究,以调查对象年龄如何记录,以及年龄代代代是否代表。我们发现,老年人所占比例非常低;研究中的5套数据明确记录了他们学科的封闭年龄间隔(定义为65岁以上),而其中只有一套数据包括老年人(定义为85岁以上)。此外,我们发现,只有24套数据在其文档或元数据中包含任何与年龄有关的信息,而且这些数据集没有采用一致的方法来收集和记录这些对象的年龄。我们发现,老年人的代表性非常低;我们认识到,在创建具有代表性的年龄组方面,在创建具有代表性的数据集方面存在着独特的困难。