Despite recent concerns about undesirable behaviors generated by large language models (LLMs), including non-factual, biased, and hateful language, we find LLMs are inherent multi-task language checkers based on their latent representations of natural and social knowledge. We present an interpretable, unified, language checking (UniLC) method for both human and machine-generated language that aims to check if language input is factual and fair. While fairness and fact-checking tasks have been handled separately with dedicated models, we find that LLMs can achieve high performance on a combination of fact-checking, stereotype detection, and hate speech detection tasks with a simple, few-shot, unified set of prompts. With the ``1/2-shot'' multi-task language checking method proposed in this work, the GPT3.5-turbo model outperforms fully supervised baselines on several language tasks. The simple approach and results suggest that based on strong latent knowledge representations, an LLM can be an adaptive and explainable tool for detecting misinformation, stereotypes, and hate speech.
翻译:尽管近期出现了有关大型语言模型(LLMs)生成不良行为的担忧,包括错误的、有偏见的和令人厌恶的语言,但我们发现LLMs是基于其自然和社交知识的潜在表示的固有多任务语言检查器。我们提出了一种可解释的、统一的语言检查(UniLC)方法,用于检查人类和机器生成的语言是否真实和公正。虽然公正性和事实检查任务曾经被独立的模型处理过,但我们发现LLMs可以通过一组简单的、少量的联合提示,在事实检查、刻板印象检测和仇恨言论检测任务上实现高性能。本文提出的“1/2-shot”多任务语言检测方法表明GPT3.5-turbo模型可以在几种语言任务上优于全监督基线。这种简单方法和结果表明基于强大的潜在知识表示,LLMs可以成为检测错误信息、刻板印象和仇恨言论的适应性和可解释性工具。