关于作者归属和作者核查的艺术现状 (On the State of the Art in Authorship Attribution and Authorship Verification)

Despite decades of research on authorship attribution (AA) and authorship verification (AV), inconsistent dataset splits/filtering and mismatched evaluation methods make it difficult to assess the state of the art. In this paper, we present a survey of the fields, resolve points of confusion, introduce Valla that standardizes and benchmarks AA/AV datasets and metrics, provide a large-scale empirical evaluation, and provide apples-to-apples comparisons between existing methods. We evaluate eight promising methods on fifteen datasets (including distribution-shifted challenge sets) and introduce a new large-scale dataset based on texts archived by Project Gutenberg. Surprisingly, we find that a traditional Ngram-based model performs best on 5 (of 7) AA tasks, achieving an average macro-accuracy of $76.50\%$ (compared to $66.71\%$ for a BERT-based model). However, on the two AA datasets with the greatest number of words per author, as well as on the AV datasets, BERT-based models perform best. While AV methods are easily applied to AA, they are seldom included as baselines in AA papers. We show that through the application of hard-negative mining, AV methods are competitive alternatives to AA methods. Valla and all experiment code can be found here: https://github.com/JacobTyo/Valla

翻译：尽管对作者归属(AA)和作者身份核查(AV)进行了数十年的研究,但关于作者归属(AAA)和作者身份核实(AV)的研究、不一致的数据集分割/过滤和不匹配的评价方法使得难以评估最新情况。在本文件中,我们介绍了对实地的调查、解决混乱点、介绍Valla将AA/AV数据集和衡量标准标准化和基准、提供大规模的经验性评价、提供现有方法之间的苹果到应用比较。我们评价了15个数据集(包括分布式挑战集)的八种有希望的方法,并采用了以Gutenberg项目存档的文本为基础的新的大型数据集。令人惊讶的是,我们发现基于Ngram的传统模型在5个(7个)AA任务中表现得最佳,实现了76.50美元的平均宏观精确度(而基于BERET的模型为66.71美元),并在现有的方法之间提供苹果到15个数据集(包括分发式的挑战集),BERT-基础模型表现最佳。虽然AVV方法很容易应用到AA的常规方法,但是我们很少把A-A的样本应用作为A的参考文件。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日