The task of identifying the author of a text spans several decades and was tackled using linguistics, statistics, and, more recently, machine learning. Inspired by the impressive performance gains across a broad range of natural language processing tasks and by the recent availability of the PAN large-scale authorship dataset, we first study the effectiveness of several BERT-like transformers for the task of authorship verification. Such models prove to achieve very high scores consistently. Next, we empirically show that they focus on topical clues rather than on author writing style characteristics, taking advantage of existing biases in the dataset. To address this problem, we provide new splits for PAN-2020, where training and test data are sampled from disjoint topics or authors. Finally, we introduce DarkReddit, a dataset with a different input data distribution. We further use it to analyze the domain generalization performance of models in a low-data regime and how performance varies when using the proposed PAN-2020 splits for fine-tuning. We show that those splits can enhance the models' capability to transfer knowledge over a new, significantly different dataset.
翻译:确定文本作者的任务跨越几十年,是用语言、统计以及最近的机器学习来完成的。由于在广泛的自然语言处理任务中取得了令人印象深刻的业绩成果,以及最近提供了PAN大规模作者数据集,我们首先研究几个类似于BERT的变压器对作者核查任务的有效性。这些模型证明能够始终取得很高的分数。接下来,我们从经验上表明,它们侧重于主题线索,而不是作者写作风格特征,利用数据集中现有的偏差来解决这一问题。为了解决这个问题,我们为PAN-2020提供了新的分解,因为那里对培训和测试数据进行了抽样,从脱节主题或作者中提取。最后,我们引入了DarkReddit,这是一个具有不同输入数据分布的数据集。我们进一步利用它来分析低数据系统中模型的域化性表现,以及在使用拟议的PAN-2020分法进行微调时,其性能如何不同。我们显示,这些分法可以提高模型在新的、显著不同的数据集中转让知识的能力。