In this paper, we introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions. Our prime motivation is to provide a reliable dataset that can be used with the existing English dataset as a benchmark to test the ability of pre-trained multilingual models to transfer knowledge between Czech and English and vice versa. Two annotators annotated the dataset reaching 0.83 of the Cohen's \k{appa} inter-annotator agreement. To the best of our knowledge, this is the first subjectivity dataset for the Czech language. We also created an additional dataset that consists of 200k automatically labeled sentences. Both datasets are freely available for research purposes. Furthermore, we fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy. We fine-tune models on the existing English dataset for which we obtained results that are on par with the current state-of-the-art results. Finally, we perform zero-shot cross-lingual subjectivity classification between Czech and English to verify the usability of our dataset as the cross-lingual benchmark. We compare and discuss the cross-lingual and monolingual results and the ability of multilingual models to transfer knowledge between languages.
翻译:在本文中,我们引入了一个新的捷克主观数据集,即10千个由10千个手工手动从电影评论和描述中附加附加说明的主观和客观句子。我们的主要动机是提供一个可靠的数据集,可用现有的英文数据集作为基准,测试预先训练的多语言模型在捷克和英语之间转让知识的能力。两个注解的数据集达到科恩的\k{appa}间顾问协议的0.83。根据我们的最佳知识,这是捷克语的第一个主题数据集。我们还创建了一个由200千个自动标注的句子组成的额外数据集。两种数据集都可以免费用于研究目的。此外,我们微调了五个经过预先训练的BERT型模型,为新数据集设定了单一语言基线,我们实现了93.56%的准确度。我们对现有英文数据集进行了微调模型,我们获得的结果与目前的最新水平结果相当。最后,我们进行了捷克和英语之间截然的跨语言跨语言的跨主题主题分类,以核实我们的数据传输能力,我们用多种语言比较了我们的数据的跨度和多语言基准。