Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by failure to capture authorship style, by the topic shift or by other factors. Motivated by this, we propose the \emph{topic confusion} task, where we switch the author-topic configuration between training and testing set. This setup allows us to probe errors in the attribution process. We investigate the accuracy and two error measures: one caused by the models' confusion by the switch because the features capture the topics, and one caused by the features' inability to capture the writing styles, leading to weaker models. By evaluating different features, we show that stylometric features with part-of-speech tags are less susceptible to topic variations and can increase the accuracy of the attribution process. We further show that combining them with word-level $n$-grams can outperform the state-of-the-art technique in the cross-topic scenario. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task, and are outperformed by simple $n$-gram features.
翻译:作者归属是确定一组候选作者匿名文本最可信的作者的问题。 研究人员已经调查了相同主题和跨主题的作者归属情景,这些情景因测试阶段是否使用隐性专题而不同。 但是, 两种情景都不允许我们解释错误是否是由于未能捕捉作者风格、 主题变化或其他因素造成的。 我们为此提议了 \emph{ 专题混淆} 任务, 我们在此将培训和测试设置之间转换作者- 主题配置。 这个设置允许我们探测归属过程中的错误。 我们调查了两种错误计量: 一种是模型的精确度和两种错误度: 一种是模型的混乱造成的,因为特征捕获了主题, 另一种是由于这些特征无法捕捉作者风格, 导致模型变弱。 我们通过评估不同的特征, 显示带有部分语音标签的特征不太易受主题变化的影响, 并且能够提高归属过程的准确性。 我们进一步显示, 把它们与字级( $gram) 结合起来, 能够超越之前的州- 模式, 原因是这些模式的混乱, 是因为这些特征无法捕捉到写写方式, 最后, 也就是B型 任务情景中, 。