Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition. We also study how LLMs understand and explain their decisions given a moderation policy as a guideline. First, we explore various prompting techniques and design a new CoT-like prompt, Guided-CoT, and find that injecting domain-specific thoughts increases performance and utility. Guided-CoT handles the in-context policy well, improving performance and utility by reducing refusals across all evaluated models, regardless of decoding configuration, model size, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability. Code and resources available at: https://github.com/idramalab/quantify-llm-explanations
翻译:检测仇恨内容是一项具有挑战性且至关重要的问题。自动化工具(如机器学习模型)可提供辅助,但需持续训练以适应社交媒体不断变化的格局。本研究评估了八种开源大型语言模型检测反犹太主义内容的能力,特别利用上下文定义方法。同时探究了在给定审核政策作为指导原则时,大型语言模型如何理解并解释其决策。首先,我们探索了多种提示技术,设计了一种新的类思维链提示方法——引导式思维链,发现注入领域特定思维能提升性能与实用性。引导式思维链能有效处理上下文政策,通过减少所有评估模型的拒绝率(无论解码配置、模型规模或推理能力如何)来改善性能与实用性。值得注意的是,Llama 3.1 70B模型的表现优于经过微调的GPT-3.5。此外,我们分析了大型语言模型的错误类型,并引入量化指标以衡量模型生成推理的语义分歧,揭示了不同模型间的显著差异与矛盾行为。实验结果表明,各大型语言模型在实用性、可解释性与可靠性方面存在明显差异。代码与资源详见:https://github.com/idramalab/quantify-llm-explanations