Adversarial attacks on machine learning-based classifiers, along with defense mechanisms, have been widely studied in the context of single-label classification problems. In this paper, we shift the attention to multi-label classification, where the availability of domain knowledge on the relationships among the considered classes may offer a natural way to spot incoherent predictions, i.e., predictions associated to adversarial examples lying outside of the training data distribution. We explore this intuition in a framework in which first-order logic knowledge is converted into constraints and injected into a semi-supervised learning problem. Within this setting, the constrained classifier learns to fulfill the domain knowledge over the marginal distribution, and can naturally reject samples with incoherent predictions. Even though our method does not exploit any knowledge of attacks during training, our experimental analysis surprisingly unveils that domain-knowledge constraints can help detect adversarial examples effectively, especially if such constraints are not known to the attacker.
翻译:在单一标签分类问题的背景下,我们广泛研究了对机器学习分类器的反向攻击以及国防机制。在本文中,我们把注意力转移到多标签分类上,因为对于被考虑的类别之间的关系的域知识的可得性可能会提供自然的发现不连贯预测的方法,即与培训数据分布外的对抗性实例相关的预测。我们在一个将一阶逻辑知识转化为制约并注入半监督学习问题的框架中探索这一直觉。在这一背景下,受限制的分类器学会了在边缘分布上实现域知识,并且可以自然地拒绝带有不连贯预测的样本。 尽管我们的方法在培训期间没有利用任何攻击知识,但我们的实验分析令人惊讶地揭示,域知识限制可以帮助有效地发现对抗性例子,特别是如果攻击者不知道这些限制因素。