Recent work has demonstrated that natural language processing techniques can support consumer protection by automatically detecting unfair clauses in the Terms of Service (ToS) Agreement. This work demonstrates that transformer-based ToS analysis systems are vulnerable to adversarial attacks. We conduct experiments attacking an unfair-clause detector with universal adversarial triggers. Experiments show that a minor perturbation of the text can considerably reduce the detection performance. Moreover, to measure the detectability of the triggers, we conduct a detailed human evaluation study by collecting both answer accuracy and response time from the participants. The results show that the naturalness of the triggers remains key to tricking readers.
翻译:最近的工作表明,自然语言处理技术可以通过自动发现《服务条款协议》中的不公平条款来支持消费者保护,这项工作表明,基于变压器的托盘分析系统很容易受到对抗性攻击。我们用通用对抗性触发器对不公平玻璃探测器进行实验。实验表明,对文本的轻微扰动可以大大降低探测性能。此外,为了测量触发器的可探测性,我们通过收集参与者的答复准确性和反应时间,进行详细的人类评价研究。结果显示,触发器的自然性仍然是欺骗读者的关键。