机器能帮助我们解答数据表中的第16个问题吗? (Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?)

Large datasets underlying much of current machine learning raise serious issues concerning inappropriate content such as offensive, insulting, threatening, or might otherwise cause anxiety. This calls for increased dataset documentation, e.g., using datasheets. They, among other topics, encourage to reflect on the composition of the datasets. So far, this documentation, however, is done manually and therefore can be tedious and error-prone, especially for large image datasets. Here we ask the arguably "circular" question of whether a machine can help us reflect on inappropriate content, answering Question 16 in Datasheets. To this end, we propose to use the information stored in pre-trained transformer models to assist us in the documentation process. Specifically, prompt-tuning based on a dataset of socio-moral values steers CLIP to identify potentially inappropriate content, therefore reducing human labor. We then document the inappropriate images found using word clouds, based on captions generated using a vision-language model. The documentations of two popular, large-scale computer vision datasets -- ImageNet and OpenImages -- produced this way suggest that machines can indeed help dataset creators to answer Question 16 on inappropriate image content.

翻译：目前许多机器学习背后的大型数据集提出了与不适当内容有关的严重问题,如冒犯、侮辱、威胁或可能引发焦虑。这要求增加数据集文档,例如使用数据表。除其他外,它们鼓励对数据集的构成进行反思。但迄今为止,这种文档是手工制作的,因此可能枯燥和容易出错,特别是大型图像数据集。我们在这里询问一个机器是否可以帮助我们思考不适当内容的“循环”问题,即一个机器能否帮助我们思考不适当内容,在数据表中回答问题16。为此,我们提议使用预先训练的变异器模型中储存的信息来协助我们进行文档进程。具体地说,根据社会道德值数据集进行快速调整,引导CLIP识别潜在不适当的内容,从而减少人类劳动。我们随后根据使用视觉语言模型生成的字幕记录了用文字云发现的不适当的图像。两个广尺度的计算机视觉数据集 -- 图像网和 Openimimages -- 生成了这个方法,表明机器确实可以帮助建立图像的答案。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。