With the rapid uptake of generative AI, investigating human perceptions of generated responses has become crucial. A major challenge is their `aptitude' for hallucinating and generating harmful contents. Despite major efforts for implementing guardrails, human perceptions of these mitigation strategies are largely unknown. We conducted a mixed-method experiment for evaluating the responses of a mitigation strategy across multiple-dimensions: faithfulness, fairness, harm-removal capacity, and relevance. In a within-subject study design, 57 participants assessed the responses under two conditions: harmful response plus its mitigation and solely mitigated response. Results revealed that participants' native language, AI work experience, and annotation familiarity significantly influenced evaluations. Participants showed high sensitivity to linguistic and contextual attributes, penalizing minor grammar errors while rewarding preserved semantic contexts. This contrasts with how language is often treated in the quantitative evaluation of LLMs. We also introduced new metrics for training and evaluating mitigation strategies and insights for human-AI evaluation studies.
翻译:随着生成式AI的快速普及,研究人类对生成响应的感知变得至关重要。一个主要挑战在于其'倾向性'产生幻觉及有害内容。尽管已投入大量努力实施防护机制,但人类对这些缓解策略的感知仍鲜为人知。我们开展了一项混合方法实验,从多个维度评估缓解策略的响应表现:忠实性、公平性、有害内容消除能力及相关性。在受试者内研究设计中,57名参与者在两种条件下评估响应:有害响应及其缓解版本,以及纯缓解后的响应。结果显示,参与者的母语、AI工作经验和标注熟悉度显著影响评估结果。参与者对语言及上下文属性表现出高度敏感性,对轻微语法错误予以扣分,同时对保留语义语境给予奖励。这与大型语言模型定量评估中常处理语言的方式形成对比。我们还提出了用于训练和评估缓解策略的新指标,并为人类-AI评估研究提供了洞见。