Machine learning has been proven to be susceptible to carefully crafted samples, known as adversarial examples. The generation of these adversarial examples helps to make the models more robust and give as an insight of the underlying decision making of these models. Over the years, researchers have successfully attacked image classifiers in, both, white and black-box setting. Although, these methods are not directly applicable to texts as text data is discrete in nature. In recent years, research on crafting adversarial examples against textual applications has been on the rise. In this paper, we present a novel approach for hard label black-box attacks against Natural Language Processing (NLP) classifiers, where no model information is disclosed, and an attacker can only query the model to get final decision of the classifier, without confidence scores of the classes involved. Such attack scenario is applicable to real world black-box models being used for security-sensitive applications such as sentiment analysis and toxic content detection
翻译:事实证明,机器学习很容易被精心制作的样本所利用,称为对抗性实例。这些对抗性实例的生成有助于使模型更加稳健,并能够深入了解这些模型的基本决策。多年来,研究人员成功地在白色和黑箱设置中袭击了图像分类者,在白色和黑箱设置中都袭击了图像分类者。虽然这些方法并不直接适用于文本,因为文本数据的性质是互不相连的。近年来,关于针对文本应用的对抗性实例的编造研究一直在增加。在本文中,我们提出了针对语言处理(NLP)分类者进行硬标签黑箱袭击的新办法,没有披露模型信息,攻击者只能查询模型,以获得分类者的最后决定,而没有获得所涉类别的信任分数。这种攻击情景适用于真实世界的黑箱模型,用于安全敏感应用,例如情绪分析和有毒内容检测。