We investigate the effects on authorship identification tasks of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ``classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that in some cases (e.g., authorship verification) it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here call Diff-Vectors) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block.
翻译:在“经典的作者分析”中,一个特性矢量的值代表了一份文件,一个特性的值代表了文件中的相对频率,而一个特性的值代表了文件的相对频率(一种日益增强的功能),而类标签则代表了文件的作者。我们调查了一个特性矢量代表了未排序的一对文件,一个特性的值代表了两种文件中特性的相对频率(或功能增加)的绝对差异,而类标签则显示这两份文件是否来自同一作者。在“经典的作者”分析中,一个特性矢量的值代表了文件的相对频率(一种功能增加 ) 。在“经典的作者”分析中,一个特性的值代表了文件的相对频率(一种功能增加的功能 ) ; 在“特级”中,一个特性的值代表了文件的相对有效性, 在某些情况下(例如,作者的核查) 它为培训过程提供了比标准描述过程要多得多的信息。我们在几个公开的数据集上进行的实验(其中一项是,我们在这里为第一次提供的是“解算” 。 在“系统化的作者”中,2级的精确的核查中, 显示数据是代表了1号的作者使用的数据的模型的比值的精确的比值的比值, 。