Are large language models (LLMs) like GPT-3 psychologically safe? In this work, we design unbiased prompts to evaluate LLMs systematically from a psychological perspective. Firstly, we test the personality traits of three different LLMs with Short Dark Triad (SD-3) and Big Five Inventory (BFI). We find all of them show higher scores on SD-3 than the human average, indicating a relatively darker personality. Furthermore, LLMs like InstructGPT and FLAN-T5, which are fine-tuned with safety metrics, do not necessarily have more positive personalities. They score higher on Machiavellianism and Narcissism than GPT-3. Secondly, we test the LLMs in GPT-3 series on well-being tests to study the impact of fine-tuning with more training data. Interestingly, we observe a continuous increase in well-being scores from GPT-3 to InstructGPT. Following the observations, we show that instruction-finetune FLAN-T5 with positive answers in BFI can effectively improve the model from a psychological perspective. Finally, we call on the community to evaluate and improve LLMs' safety systematically instead of at the sentence level only.
翻译:大型语言模型(LLMs)像GPT-3这样的大语言模型(LLMs)是否具有心理安全性?在这项工作中,我们设计了不带偏见的提示,以便从心理角度对LLMs进行系统评估。首先,我们用短黑三合(SD-3)和五大目录(BFI)来测试三个不同的LMs的个性特征。我们发现,所有这些LLMs在SD-3上的得分都高于人的平均分,这显示了相对黑暗的个性。此外,像SportGPT和FLAN-T5这样的有安全度的微调的LLMs,不一定比GPT-3的得分高。在Mchiavelliism和Narcisisism方面得分更高。第二,我们用更多的培训数据来测试GPT-3系列中的三个LMs,以研究微调的影响。有趣的是,我们观察到从GPT-3到GPTT的幸福得分持续增加。我们发现,在观察后显示,在BFI有积极答案的教学-fineneunefune FLLAN-T5能够从心理角度有效地改进模型。最后我们呼吁社区对LLMs的安全进行系统的评价和改进。我们要求,而不是只在判决一级进行系统评价。