Conflicting explanations, arising from different attribution methods or model internals, limit the adoption of machine learning models in safety-critical domains. We turn this disagreement into an advantage and introduce EXplanation AGREEment (EXAGREE), a two-stage framework that selects a Stakeholder-Aligned Explanation Model (SAEM) from a set of similar-performing models. The selection maximizes Stakeholder-Machine Agreement (SMA), a single metric that unifies faithfulness and plausibility. EXAGREE couples a differentiable mask-based attribution network (DMAN) with monotone differentiable sorting, enabling gradient-based search inside the constrained model space. Experiments on six real-world datasets demonstrate simultaneous gains of faithfulness, plausibility, and fairness over baselines, while preserving task accuracy. Extensive ablation studies, significance tests, and case studies confirm the robustness and feasibility of the method in practice.
翻译:由不同归因方法或模型内部机制产生的冲突性解释,限制了机器学习模型在安全关键领域的应用。我们将这种分歧转化为优势,提出了解释一致性(EXAGREE)框架,这是一个两阶段框架,用于从一组性能相近的模型中选择利益相关者对齐解释模型(SAEM)。该选择过程旨在最大化利益相关者-机器一致性(SMA),这是一个统一忠实性与合理性的单一指标。EXAGREE将可微分掩码归因网络(DMAN)与单调可微分排序相结合,实现了在约束模型空间内基于梯度的搜索。在六个真实世界数据集上的实验表明,该方法在保持任务准确性的同时,在忠实性、合理性和公平性方面均优于基线。广泛的消融研究、显著性检验和案例研究证实了该方法在实际应用中的鲁棒性和可行性。