Automated content filtering and moderation is an important tool that allows online platforms to build striving user communities that facilitate cooperation and prevent abuse. Unfortunately, resourceful actors try to bypass automated filters in a bid to post content that violate platform policies and codes of conduct. To reach this goal, these malicious actors may obfuscate policy violating images (e.g. overlay harmful images by carefully selected benign images or visual patterns) to prevent machine learning models from reaching the correct decision. In this paper, we invite researchers to tackle this specific issue and present a new image benchmark. This benchmark, based on ImageNet, simulates the type of obfuscations created by malicious actors. It goes beyond ImageNet-$\textrm{C}$ and ImageNet-$\bar{\textrm{C}}$ by proposing general, drastic, adversarial modifications that preserve the original content intent. It aims to tackle a more common adversarial threat than the one considered by $\ell_p$-norm bounded adversaries. We evaluate 33 pretrained models on the benchmark and train models with different augmentations, architectures and training methods on subsets of the obfuscations to measure generalization. We hope this benchmark will encourage researchers to test their models and methods and try to find new approaches that are more robust to these obfuscations.
翻译:自动内容过滤和节制是一个重要的工具,使在线平台能够建设促进合作和防止滥用的用户群。 不幸的是,机智的行为体试图绕过自动过滤器,试图张贴违反平台政策和行为守则的内容。为实现这一目标,这些恶意行为体可能会回避违反图像的政策(例如,通过仔细选择的良性图像或视觉模式,覆盖有害图像),以防止机器学习模式达成正确的决定。在本文中,我们请研究人员处理这一具体问题,并提出新的图像基准。根据图像网,这一基准模拟恶意行为体制造的混淆类型。它超越了图像网-$\ textrm{C}$和图像网-$\bar textrm{C}$。为了达到这个目标,这些恶意行为体可能会回避违反图像的政策(例如,通过精心选择的良性图像或视觉模式,覆盖有害图像),以防止机器学习模式达成正确的决定。我们邀请研究人员处理这一具体问题,并提出新的图像基准中33个预先培训的模型,并用不同的增强能力、架构和培训方法模拟恶意行为体制造的混乱类型。 通过提议一般的测试方法,我们将会找到新的测试方法。