Deep neural networks (DNNs) are powerful, but they can make mistakes that pose significant risks. A model performing well on a test set does not imply safety in deployment, so it is important to have additional tools to understand its flaws. Adversarial examples can help reveal weaknesses, but they are often difficult for a human to interpret or draw generalizable, actionable conclusions from. Some previous works have addressed this by studying human-interpretable attacks. We build on these with three contributions. First, we introduce a method termed Search for Natural Adversarial Features Using Embeddings (SNAFUE) which offers a fully-automated method for finding "copy/paste" attacks in which one natural image can be pasted into another in order to induce an unrelated misclassification. Second, we use this to red team an ImageNet classifier and identify hundreds of easily-describable sets of vulnerabilities. Third, we compare this approach with other interpretability tools by attempting to rediscover trojans. Our results suggest that SNAFUE can be useful for interpreting DNNs and generating adversarial data for them. Code is available at https://github.com/thestephencasper/snafue
翻译:深心神经网络(DNNS)是强大的,但它们可以做出重大风险的错误。 在测试集上表现良好的模型并不意味着部署的安全性,因此必须拥有更多工具来理解其缺陷。 反向实例可以帮助揭示弱点, 但人类通常很难解释或得出一般可操作的结论。 一些先前的著作已经通过研究人类解释的攻击来解决这个问题。 我们以三点贡献为基础。 首先, 我们引入了一种名为“ 利用嵌入仪搜索自然反向特征” 的方法, 这种方法提供一种完全自动化的方法来查找“ 复制/ 帕斯特” 攻击, 在这种攻击中, 一种自然图像可以粘贴到另一个攻击中, 以诱导不相关的错误分类。 其次, 我们用这个方法来红一个图像网络分类器, 并识别成百套容易辨别的弱点。 第三, 我们用这些方法与其他可解释工具进行比较, 试图重新发现热带卫星。 我们的结果表明, SNAFUE可以用于解释 DNNPS, 并生成对抗性数据。 代码可在 https://qubas/ comcastas。