To date, there has been no formal study of the statistical cost of interpretability in machine learning. As such, the discourse around potential trade-offs is often informal and misconceptions abound. In this work, we aim to initiate a formal study of these trade-offs. A seemingly insurmountable roadblock is the lack of any agreed upon definition of interpretability. Instead, we propose a shift in perspective. Rather than attempt to define interpretability, we propose to model the \emph{act} of \emph{enforcing} interpretability. As a starting point, we focus on the setting of empirical risk minimization for binary classification, and view interpretability as a constraint placed on learning. That is, we assume we are given a subset of hypothesis that are deemed to be interpretable, possibly depending on the data distribution and other aspects of the context. We then model the act of enforcing interpretability as that of performing empirical risk minimization over the set of interpretable hypotheses. This model allows us to reason about the statistical implications of enforcing interpretability, using known results in statistical learning theory. Focusing on accuracy, we perform a case analysis, explaining why one may or may not observe a trade-off between accuracy and interpretability when the restriction to interpretable classifiers does or does not come at the cost of some excess statistical risk. We close with some worked examples and some open problems, which we hope will spur further theoretical development around the tradeoffs involved in interpretability.
翻译:迄今为止,还没有正式研究机器学习解释的统计成本。 因此,关于潜在权衡的论述往往是非正式的,误解也很多。 在这项工作中,我们的目标是开始对这些权衡进行正式研究。一个看似不可逾越的障碍是缺乏任何关于解释定义的一致意见。相反,我们提议改变观点。我们提议,与其试图界定解释的可解释性,不如以模型为模型。作为一个起点,我们侧重于为二进制分类设定经验风险最小化,并将可解释性视为对学习的制约。在这项工作中,我们假定我们得到了一套假设,这些假设被认为是可以解释的,可能取决于数据分配和背景的其他方面。我们然后将执行解释性的行为作为执行对一套可解释的假设进行最小化的经验风险最小化的模型。这个模型让我们能够解释执行可解释性所涉的统计问题,使用已知的统计学习理论结果。侧重于准确性、我们进行案例分析,并解释某些可能无法解释的精确性,或者在统计风险发生时,我们无法精确性地解释一些解释。