我希望我曾经爱过这一套, 但我没有-- 多语言数据集用于在产品审查中反事实检测 (I Wish I Would Have Loved This One, But I Didn't -- A Multilingual Dataset for Counterfactual Detection in Product Reviews)

Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality professional annotations. We train CFD models using different text representation methods and classifiers. We find that these models are robust against the selectional biases introduced due to cue phrase-based sentence selection. Moreover, our CFD dataset is compatible with prior datasets and can be merged to learn accurate CFD models. Applying machine translation on English counterfactual examples to create multilingual data performs poorly, demonstrating the language-specificity of this problem, which has been ignored so far.

翻译：反事实陈述描述了没有发生或无法发生的事件。我们考虑了产品审查中的反事实检测问题。为此,我们从亚马逊产品审查中注意到一套多语言的反事实检测数据组,包括英文、德文和日文的反事实陈述。数据集是独一无二的,因为它包含多种语言的反事实,涵盖电子商务审查的新应用领域,并提供高质量的专业说明。我们用不同的文本表述方法和分类方法培训反事实检测模型。我们发现这些模型对基于词组选择的词组选择偏差是很强的。此外,我们的CFD数据集与先前的数据集兼容,可以合并来学习准确的CFD模型。在英文反事实实例上应用机器翻译来创建多语言数据效果不佳,显示了这一问题的语言特性,但迄今为止一直被忽视。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日