Large language models (LLMs) can pass explicit bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both of these challenges by introducing two measures of bias inspired by psychology: LLM Implicit Association Test (IAT) Bias, which is a prompt-based method for revealing implicit bias; and LLM Decision Bias for detecting subtle discrimination in decision-making tasks. Using these measures, we found pervasive human-like stereotype biases in 6 LLMs across 4 social domains (race, gender, religion, health) and 21 categories (weapons, guilt, science, career among others). Our prompt-based measure of implicit bias correlates with embedding-based methods but better predicts downstream behaviors measured by LLM Decision Bias. This measure is based on asking the LLM to decide between individuals, motivated by psychological results indicating that relative not absolute evaluations are more related to implicit biases. Using prompt-based measures informed by psychology allows us to effectively expose nuanced biases and subtle discrimination in proprietary LLMs that do not show explicit bias on standard benchmarks.
翻译:暂无翻译