Over the past few years, different types of data-driven Artificial Intelligence (AI) techniques have been widely adopted in various domains of science for generating predictive black-box models. However, because of their black-box nature, it is crucial to establish trust in these models before accepting them as accurate. One way of achieving this goal is through the implementation of a post-hoc interpretation scheme that can put forward the reasons behind a black-box model prediction. In this work, we propose a classical thermodynamics inspired approach for this purpose: Thermodynamically Explainable Representations of AI and other black-box Paradigms (TERP). TERP works by constructing a linear, local surrogate model that approximates the behaviour of the black-box model within a small neighborhood around the instance being explained. By employing a simple forward feature selection Monte Carlo algorithm, TERP assigns an interpretability free energy score to all the possible surrogate models in order to choose an optimal interpretation. Additionally, we validate TERP as a generally applicable method by successfully interpreting four different classes of black-box models trained on datasets coming from relevant domains, including classifying images, predicting heart disease and classifying biomolecular conformations.
翻译:过去几年来,不同科学领域广泛采用不同类型的数据驱动人工智能(AI)技术,以产生预测黑盒模型。然而,由于其黑盒性质,在接受这些模型为准确性之前,必须对这些模型建立信任。实现这一目标的一种方法是实施一个热后解释方案,该方案可以提出黑盒模型预测背后的原因。在这项工作中,我们提议为此目的采用一种典型的热力动力激励方法:AI和其他黑盒模型的热动力可解释表达法;TRERP通过建立一个线性、地方代金模型开展工作,该模型可以接近黑盒模型在所解释的小社区内的行为。通过使用简单的前方特征选择Monte Carlo算法,TRERP将可解释的免费能源评分分配给所有可能的代金模型,以便选择一种最佳解释。此外,我们验证TRERP是一种普遍适用的方法,成功地解释了从有关领域开始的数据集中培训的四种不同的黑盒模型,包括进行分类,并更新了模型的模型的类别。