Concept-based explanation methods aim at making machine learning models more transparent by finding the most important semantic features of an input (e.g., colors, patterns, shapes) for a given prediction task. However, these methods generally ignore the communicative context of explanations, such as the preferences of a listener. For example, medical doctors understand explanations in terms of clinical markers, but patients may not, needing a different vocabulary to rationalize the same diagnosis. We address this gap with listener-adaptive explanations grounded in principles of pragmatic reasoning and the rational speech act. We introduce an iterative training procedure based on direct preference optimization where a speaker learns to compose explanations that maximize communicative utility for a listener. Our approach only needs access to pairwise preferences, which can be collected from human feedback, making it particularly relevant in real-world scenarios where a model of the listener may not be available. We demonstrate that our method is able to align speakers with the preferences of simulated listeners on image classification across three datasets, and further validate that pragmatic explanations generated with our method improve the classification accuracy of participants in a user study.
翻译:暂无翻译