Adversarial attacks challenge the reliability of Explainable AI (XAI) by altering explanations while the model's output remains unchanged. The success of these attacks on text-based XAI is often judged using standard information retrieval metrics. We argue these measures are poorly suited in the evaluation of trustworthiness, as they treat all word perturbations equally while ignoring synonymity, which can misrepresent an attack's true impact. To address this, we apply synonymity weighting, a method that amends these measures by incorporating the semantic similarity of perturbed words. This produces more accurate vulnerability assessments and provides an important tool for assessing the robustness of AI systems. Our approach prevents the overestimation of attack success, leading to a more faithful understanding of an XAI system's true resilience against adversarial manipulation.
翻译:对抗性攻击通过改变解释而保持模型输出不变,从而挑战可解释人工智能(XAI)的可靠性。针对文本XAI的攻击成功率通常使用标准信息检索指标进行评估。我们认为这些指标在可信度评估中适用性不足,因为它们对所有词汇扰动一视同仁而忽略同义性,可能错误表征攻击的真实影响。为解决此问题,我们应用同义性加权方法,通过融入扰动词语义相似性来修正这些指标。该方法可生成更精确的脆弱性评估,并为评估AI系统鲁棒性提供重要工具。我们的方法避免了攻击成功率高估问题,从而更真实地理解XAI系统对抗对抗性操纵的实际韧性。