Text classification methods have been widely investigated as a way to detect content of low credibility: fake news, social media bots, propaganda, etc. Quite accurate models (likely based on deep neural networks) help in moderating public electronic platforms and often cause content creators to face rejection of their submissions or removal of already published texts. Having the incentive to evade further detection, content creators try to come up with a slightly modified version of the text (known as an attack with an adversarial example) that exploit the weaknesses of classifiers and result in a different output. Here we introduce BODEGA: a benchmark for testing both victim models and attack methods on four misinformation detection tasks in an evaluation framework designed to simulate real use-cases of content moderation. We also systematically test the robustness of popular text classifiers against available attacking techniques and discover that, indeed, in some cases barely significant changes in input text can mislead the models. We openly share the BODEGA code and data in hope of enhancing the comparability and replicability of further research in this area.
翻译:作为检测低可信度内容的一种方法,对文本分类方法进行了广泛的调查:假新闻、社交媒体机器人、宣传等。相当准确的模式(可能基于深层神经网络)有助于调节公共电子平台,并常常导致内容创作者面对其提交材料遭到拒绝或删除已经出版的文本。由于有逃避进一步检测的动机,内容创作者试图拿出一个稍作修改的文本版本(称为对抗性例子的攻击),利用分类者的弱点并导致不同的产出。我们在这里引入了BODEGA:用于测试受害者模式和攻击方法的基准,用以测试四个错误信息检测任务,以模拟内容中度的实际使用案例。我们还系统地测试流行文本分类者针对现有攻击技术的稳健性,并发现在某些情况下,输入文本中几乎没有重大变化能够误导模型。我们公开分享了BODEGA的代码和数据,希望提高该领域进一步研究的可比性和可复制性。</s>