不看就看而不看:儿童性虐待分析管道数据集 (Seeing without Looking: Analysis Pipeline for Child Sexual Abuse Datasets)

The online sharing and viewing of Child Sexual Abuse Material (CSAM) are growing fast, such that human experts can no longer handle the manual inspection. However, the automatic classification of CSAM is a challenging field of research, largely due to the inaccessibility of target data that is - and should forever be - private and in sole possession of law enforcement agencies. To aid researchers in drawing insights from unseen data and safely providing further understanding of CSAM images, we propose an analysis template that goes beyond the statistics of the dataset and respective labels. It focuses on the extraction of automatic signals, provided both by pre-trained machine learning models, e.g., object categories and pornography detection, as well as image metrics such as luminance and sharpness. Only aggregated statistics of sparse signals are provided to guarantee the anonymity of children and adolescents victimized. The pipeline allows filtering the data by applying thresholds to each specified signal and provides the distribution of such signals within the subset, correlations between signals, as well as a bias evaluation. We demonstrated our proposal on the Region-based annotated Child Pornography Dataset (RCPD), one of the few CSAM benchmarks in the literature, composed of over 2000 samples among regular and CSAM images, produced in partnership with Brazil's Federal Police. Although noisy and limited in several senses, we argue that automatic signals can highlight important aspects of the overall distribution of data, which is valuable for databases that can not be disclosed. Our goal is to safely publicize the characteristics of CSAM datasets, encouraging researchers to join the field and perhaps other institutions to provide similar reports on their benchmarks.

翻译：在线分享和观看儿童性虐待材料(CSAM)正在迅速发展,使人类专家无法再处理手工检查,然而,CSAM的自动分类是一个具有挑战性的研究领域,主要是因为无法获取目标数据,而这些数据是私人的,而且永远应该是私人的,执法机构独有。为了帮助研究人员从秘密数据中提取洞见,并安全地进一步理解CSAM图像,我们提议了一个分析模板,该模板超出了数据集和标签的统计范围。它侧重于提取自动信号,由预先培训的机器学习模型提供,例如,目标类别和色情检测,以及图像测量,如亮度和清晰度等,这是一个具有挑战性的研究领域。只提供稀少信号的汇总统计数据,以保证受害儿童和青少年的匿名性。为了帮助研究人员从各种特定信号中获取洞察,并在子集中提供此类信号的传播,信号的相互关联性,以及相关的标签评估。我们展示了我们关于基于区域的说明性儿童色情数据集(RCPD)的建议,这是CSAAM的少数实地基准,而我们在CAM数据库中定期提供的重要数据,而CAM数据库的样本和CRiralalalalal Stal sal sal sal real real real sal sal real real real sal real sal sal sal sal sal sal besmlations sal slemissmluplesmals.