Counterfactual examples have been shown to be useful for many applications, including calibrating, evaluating, and explaining model decision boundaries. However, previous methods for generating such counterfactual examples have been tightly tailored to a specific application, used a limited range of linguistic patterns, or are hard to scale. We propose to disentangle counterfactual generation from its use cases, i.e., gather general-purpose counterfactuals first, and then select them for specific applications. We frame the automated counterfactual generation as text generation, and finetune GPT-2 into a generator, Polyjuice, which produces fluent and diverse counterfactuals. Our method also allows control over where perturbations happen and what they do. We show Polyjuice supports multiple use cases: by generating diverse counterfactuals for humans to label, Polyjuice helps produce high-quality datasets for model training and evaluation, requiring 40% less human effort. When used to generate explanations, Polyjuice helps augment feature attribution methods to reveal models' erroneous behaviors.
翻译:事实证明,反事实例子对许多应用都有用,包括校准、评估和解释示范决定界限。然而,以往生成此类反事实例子的方法是针对特定应用的严格定制的,使用的语言模式范围有限,或规模难以扩大。我们提议将反事实生成与其使用案例脱钩,即首先收集普通用途反事实,然后选择用于具体应用。我们将自动反事实生成作为文本生成,并将GPT-2微调成一个生成流利和多样反事实的生成器。我们的方法还允许控制发生扰动的地方及其所作所为。我们展示的是多juice支持多种使用案例:通过为人类提供标签的多种反事实,多juice帮助为模型培训和评估制作高质量的数据集,而少要求40%的人力努力。当我们用来作出解释时,多juice帮助增强特征归属方法,以揭示模型的错误行为。