Biases in LLMs can harm user experience and societal outcomes. Current bias mitigation methods such as RLHF usually rely on costly human feedback, lack transferability to other topics, and show poor performance. We find that informing the LLMs that their generated content is not generated by them and querying about potential biases greatly boosts their awareness and ability to mitigate biases. Based on this, we propose RLDF (Reinforcement Learning from Multi-role Debates as Feedback), replacing human feedback with AI for bias mitigation. RLDF engages LLMs in multi-role debates to expose biases and gradually reduce biases in each iteration using a ranking scoring mechanism. The dialogue are then used to create a dataset composed of both high bias and low bias instances to train the reward model in reinforcement learning. This dataset can be generated by the same LLM for self-reflection or a superior LLM like an API which guides the former one in a teacher-student mode. Experimental results across different LLMs and types of bias show the effectiveness of our approach in bias mitigation.
翻译:暂无翻译