Despite decades of research on authorship attribution (AA) and authorship verification (AV), inconsistent dataset splits/filtering and mismatched evaluation methods make it difficult to assess the state of the art. In this paper, we present a survey of the fields, resolve points of confusion, introduce Valla that standardizes and benchmarks AA/AV datasets and metrics, provide a large-scale empirical evaluation, and provide apples-to-apples comparisons between existing methods. We evaluate eight promising methods on fifteen datasets (including distribution-shifted challenge sets) and introduce a new large-scale dataset based on texts archived by Project Gutenberg. Surprisingly, we find that a traditional Ngram-based model performs best on 5 (of 7) AA tasks, achieving an average macro-accuracy of $76.50\%$ (compared to $66.71\%$ for a BERT-based model). However, on the two AA datasets with the greatest number of words per author, as well as on the AV datasets, BERT-based models perform best. While AV methods are easily applied to AA, they are seldom included as baselines in AA papers. We show that through the application of hard-negative mining, AV methods are competitive alternatives to AA methods. Valla and all experiment code can be found here: https://github.com/JacobTyo/Valla
翻译:尽管对作者归属(AA)和作者身份核查(AV)进行了数十年的研究,但关于作者归属(AAA)和作者身份核实(AV)的研究、不一致的数据集分割/过滤和不匹配的评价方法使得难以评估最新情况。在本文件中,我们介绍了对实地的调查、解决混乱点、介绍Valla将AA/AV数据集和衡量标准标准化和基准、提供大规模的经验性评价、提供现有方法之间的苹果到应用比较。我们评价了15个数据集(包括分布式挑战集)的八种有希望的方法,并采用了以Gutenberg项目存档的文本为基础的新的大型数据集。令人惊讶的是,我们发现基于Ngram的传统模型在5个(7个)AA任务中表现得最佳,实现了76.50美元的平均宏观精确度(而基于BERET的模型为66.71美元),并在现有的方法之间提供苹果到15个数据集(包括分发式的挑战集),BERT-基础模型表现最佳。虽然AVV方法很容易应用到AA的常规方法,但是我们很少把A-A的样本应用作为A的参考文件。