A prominent issue in aligning language models (LMs) to personalized preferences is underspecification -- the lack of information from users about their preferences. A popular trend of injecting such specification is adding a prefix (e.g. prior relevant conversations) to the current user's conversation to steer preference distribution. Most methods passively model personal preferences with prior example preferences pairs. We ask whether models benefit from actively inferring preference descriptions, and address this question by creating a synthetic personalized alignment dataset based on famous people with known public preferences. We then test how effective finetuned 1-8B size models are at inferring and aligning to personal preferences. Results show that higher-quality active prefixes lead to better generalization, more contextually faithful models, and less systematic biases across different protected attributes. All our results suggest active alignment can lead to a more controllable and efficient path for personalized alignment.
翻译:暂无翻译