Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality professional annotations. We train CFD models using different text representation methods and classifiers. We find that these models are robust against the selectional biases introduced due to cue phrase-based sentence selection. Moreover, our CFD dataset is compatible with prior datasets and can be merged to learn accurate CFD models. Applying machine translation on English counterfactual examples to create multilingual data performs poorly, demonstrating the language-specificity of this problem, which has been ignored so far.
翻译:反事实陈述描述了没有发生或无法发生的事件。我们考虑了产品审查中的反事实检测问题。为此,我们从亚马逊产品审查中注意到一套多语言的反事实检测数据组,包括英文、德文和日文的反事实陈述。数据集是独一无二的,因为它包含多种语言的反事实,涵盖电子商务审查的新应用领域,并提供高质量的专业说明。我们用不同的文本表述方法和分类方法培训反事实检测模型。我们发现这些模型对基于词组选择的词组选择偏差是很强的。此外,我们的CFD数据集与先前的数据集兼容,可以合并来学习准确的CFD模型。在英文反事实实例上应用机器翻译来创建多语言数据效果不佳,显示了这一问题的语言特性,但迄今为止一直被忽视。