Understanding predictions made by deep neural networks is notoriously difficult, but also crucial to their dissemination. As all machine learning based methods, they are as good as their training data, and can also capture unwanted biases. While there are tools that can help understand whether such biases exist, they do not distinguish between correlation and causation, and might be ill-suited for text-based models and for reasoning about high level language concepts. A key problem of estimating the causal effect of a concept of interest on a given model is that this estimation requires the generation of counterfactual examples, which is challenging with existing generation technology. To bridge that gap, we propose CausaLM, a framework for producing causal model explanations using counterfactual language representation models. Our approach is based on fine-tuning of deep contextualized embedding models with auxiliary adversarial tasks derived from the causal graph of the problem. Concretely, we show that by carefully choosing auxiliary adversarial pre-training tasks, language representation models such as BERT can effectively learn a counterfactual representation for a given concept of interest, and be used to estimate its true causal effect on model performance. A byproduct of our method is a language representation model that is unaffected by the tested concept, which can be useful in mitigating unwanted bias ingrained in the data.
翻译:深层神经网络的预测很难理解,但对于其传播也至关重要。所有基于机器的学习方法都与培训数据一样好,而且可以捕捉不必要的偏差。虽然有一些工具可以帮助理解是否存在这种偏差,但它们没有区分关联和因果关系,可能不适合基于文字的模式和高层次语言概念的推理。估计对某一模式感兴趣的概念的因果关系的一个关键问题是,这一估计需要生成反事实实例,而这种实例对现有的一代技术具有挑战性。为了缩小这一差距,我们建议CausaLM,这是一个利用反事实语言代表模型来产生因果关系示范解释的框架。我们的方法基于对深度背景化的嵌入模型进行微调,并辅之以从问题因果关系图中得出的辅助性对抗性任务。具体地说,我们表明,通过仔细选择辅助性对抗性培训前任务,像BERT这样的语言代表模式可以有效地学习对特定兴趣概念的反事实表述,并用来估计其对模型的实际因果关系。我们的方法的副产品是,一种在减轻风险时能够测试数据。