As neural networks become the tool of choice to solve an increasing variety of problems in our society, adversarial attacks become critical. The possibility of generating data instances deliberately designed to fool a network's analysis can have disastrous consequences. Recent work has shown that commonly used methods for model training often result in fragile abstract representations that are particularly vulnerable to such attacks. This paper presents a visual framework to investigate neural network models subjected to adversarial examples, revealing how models' perception of the adversarial data differs from regular data instances and their relationships with class perception. Through different use cases, we show how observing these elements can quickly pinpoint exploited areas in a model, allowing further study of vulnerable features in input data and serving as a guide to improving model training and architecture.
翻译:由于神经网络成为解决我们社会越来越多的各种问题的首选工具,对抗性攻击变得至关重要。蓄意制造数据事件以欺骗一个网络的分析,可能产生灾难性后果。最近的工作表明,通常使用的示范培训方法往往造成脆弱的抽象表述,特别容易发生这种攻击。本文提供了一个视觉框架,用以调查受到对抗性例子影响的神经网络模型,揭示模型对对抗性数据的看法如何不同于经常数据实例及其与阶级感知的关系。通过不同的使用案例,我们展示观察这些要素如何能在模型中迅速确定被利用的地区,允许进一步研究投入数据中的脆弱特征,并作为改进模型培训和结构的指南。