评估大型语言模型在检测反犹太主义内容中的表现 (Evaluating Large Language Models for Detecting Antisemitism)

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition. We also study how LLMs understand and explain their decisions given a moderation policy as a guideline. First, we explore various prompting techniques and design a new CoT-like prompt, Guided-CoT, and find that injecting domain-specific thoughts increases performance and utility. Guided-CoT handles the in-context policy well, improving performance and utility by reducing refusals across all evaluated models, regardless of decoding configuration, model size, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability. Code and resources available at: https://github.com/idramalab/quantify-llm-explanations

翻译：检测仇恨内容是一项具有挑战性且至关重要的问题。自动化工具（如机器学习模型）可提供辅助，但需持续训练以适应社交媒体不断变化的格局。本研究评估了八种开源大型语言模型检测反犹太主义内容的能力，特别利用上下文定义方法。同时探究了在给定审核政策作为指导原则时，大型语言模型如何理解并解释其决策。首先，我们探索了多种提示技术，设计了一种新的类思维链提示方法——引导式思维链，发现注入领域特定思维能提升性能与实用性。引导式思维链能有效处理上下文政策，通过减少所有评估模型的拒绝率（无论解码配置、模型规模或推理能力如何）来改善性能与实用性。值得注意的是，Llama 3.1 70B模型的表现优于经过微调的GPT-3.5。此外，我们分析了大型语言模型的错误类型，并引入量化指标以衡量模型生成推理的语义分歧，揭示了不同模型间的显著差异与矛盾行为。实验结果表明，各大型语言模型在实用性、可解释性与可靠性方面存在明显差异。代码与资源详见：https://github.com/idramalab/quantify-llm-explanations