In this work, we introduce a corpus for satire detection in Romanian news. We gathered 55,608 public news articles from multiple real and satirical news sources, composing one of the largest corpora for satire detection regardless of language and the only one for the Romanian language. We provide an official split of the text samples, such that training news articles belong to different sources than test news articles, thus ensuring that models do not achieve high performance simply due to overfitting. We conduct experiments with two state-of-the-art deep neural models, resulting in a set of strong baselines for our novel corpus. Our results show that the machine-level accuracy for satire detection in Romanian is quite low (under 73% on the test set) compared to the human-level accuracy (87%), leaving enough room for improvement in future research.
翻译:在这项工作中,我们在罗马尼亚新闻中引入了讽刺检测程序。我们收集了来自多个真实和讽刺性新闻来源的55,608篇公共新闻文章,组成了一个无论语言和罗马尼亚语言都用于讽刺性检测的最大社团之一。我们提供了文本样本的正式分解,这样,培训新闻文章的资料来源不同于测试性新闻文章,从而确保模型不会仅仅因为过度装配而取得高性能。我们用两种最先进的深层神经模型进行实验,为我们的新材料建立了一套强有力的基线。我们的结果显示,罗马尼亚的讽刺性检测机级精确度(测试集中不到73%)与人级精确度(87 % )相比相当低,为未来研究留下足够的改进空间。