To ensure trust in AI models, it is becoming increasingly apparent that evaluation of models must be extended beyond traditional performance metrics, like accuracy, to other dimensions, such as fairness, explainability, adversarial robustness, and distribution shift. We describe an empirical study to evaluate multiple model types on various metrics along these dimensions on several datasets. Our results show that no particular model type performs well on all dimensions, and demonstrate the kinds of trade-offs involved in selecting models evaluated along multiple dimensions.
翻译:为了确保对AI模型的信任,越来越明显的是,对模型的评价必须超越传统的性能衡量标准,例如准确性,扩大到其他方面,例如公平、解释、对抗性强和分布变化。 我们描述了一项经验性研究,以评价几个数据集上关于这些方面的各种计量的多种模型类型。 我们的结果显示,没有哪一种特定模型在所有方面都表现良好,并表明选择多维性评估模型所涉及的各种权衡。