We study the problem of generating counterfactual text for a classifier as a means for understanding and debugging classification. Given a textual input and a classification model, we aim to minimally alter the text to change the model's prediction. White-box approaches have been successfully applied to similar problems in vision where one can directly optimize the continuous input. Optimization-based approaches become difficult in the language domain due to the discrete nature of text. We bypass this issue by directly optimizing in the latent space and leveraging a language model to generate candidate modifications from optimized latent representations. We additionally use Shapley values to estimate the combinatoric effect of multiple changes. We then use these estimates to guide a beam search for the final counterfactual text. We achieve favorable performance compared to recent white-box and black-box baselines using human and automatic evaluations. Ablation studies show that both latent optimization and the use of Shapley values improve success rate and the quality of the generated counterfactuals.
翻译:我们研究为分类器生成反事实文本的问题,以此作为理解和调试分类的手段。根据文本输入和分类模型,我们的目标是对文本进行最低限度的修改,以改变模型的预测。白箱方法已经成功地应用于视觉中的类似问题,在那里,人们可以直接优化连续输入。由于文本的离散性质,优化基于语言的方法在语言领域变得困难。我们绕过这个问题,直接优化潜在空间,利用语言模型从优化的潜伏表达中产生候选修改。我们另外使用“毛绒”值来估计多种变化的组合效应。我们然后使用这些估计数来指导最终反事实文本的波束搜索。我们利用人和自动评价,取得了与最近的白箱和黑箱基线相比的优异性业绩。吸收研究显示,潜在优化和使用“毛箱”值都提高了成功率和生成反事实的质量。