Child Sexual Abuse Media (CSAM) is any visual record of a sexually-explicit activity involving minors. CSAM impacts victims differently from the actual abuse because the distribution never ends, and images are permanent. Machine learning-based solutions can help law enforcement quickly identify CSAM and block digital distribution. However, collecting CSAM imagery to train machine learning models has many ethical and legal constraints, creating a barrier to research development. With such restrictions in place, the development of CSAM machine learning detection systems based on file metadata uncovers several opportunities. Metadata is not a record of a crime, and it does not have legal restrictions. Therefore, investing in detection systems based on metadata can increase the rate of discovery of CSAM and help thousands of victims. We propose a framework for training and evaluating deployment-ready machine learning models for CSAM identification. Our framework provides guidelines to evaluate CSAM detection models against intelligent adversaries and models' performance with open data. We apply the proposed framework to the problem of CSAM detection based on file paths. In our experiments, the best-performing model is based on convolutional neural networks and achieves an accuracy of 0.97. Our evaluation shows that the CNN model is robust against offenders actively trying to evade detection by evaluating the model against adversarially modified data. Experiments with open datasets confirm that the model generalizes well and is deployment-ready.
翻译:儿童性虐待媒体(CSAM)是涉及未成年人的性活动的任何视觉记录。CSAM对受害者的影响不同于实际虐待,因为其传播从未结束,而且图像是永久的。机器学习解决方案可以帮助执法部门迅速确定CSAM,并阻止数字传播。然而,收集CSAAM图像以培训机器学习模式,有许多道德和法律限制,为研究开发制造障碍。有了这些限制,CSAAM基于档案元数据开发机器学习检测系统发现了若干机会。元数据不是犯罪记录,也没有法律限制。因此,投资于基于元数据的探测系统可以提高CSAM的发现率,帮助成千上万受害者。我们提出了一个框架,用于培训和评价CSAM识别的成熟机学习模式。我们的框架提供了指导方针,用以评价CSAAM检测模式与智能对手和模型的公开数据性能。我们根据档案路径对CSAAM检测问题应用了拟议框架。在我们的实验中,最佳表现模型基于革命神经网络,并且没有法律限制。因此,投资于基于元数据的发现速度模型的精确度。我们提出的一个框架表明,对CNAM数据库进行严格的实验性测试,通过测试来测试测试,以便测试。