Cross-language authorship attribution problems rely on either translation to enable the use of single-language features, or language-independent feature extraction methods. Until recently, the lack of datasets for this problem hindered the development of the latter, and single-language solutions were performed on machine-translated corpora. In this paper, we present a novel language-independent feature for authorship analysis based on dependency graphs and universal part of speech tags, called DT-grams (dependency tree grams), which are constructed by selecting specific sub-parts of the dependency graph of sentences. We evaluate DT-grams by performing cross-language authorship attribution on untranslated datasets of bilingual authors, showing that, on average, they achieve a macro-averaged F1 score of 0.081 higher than previous methods across five different language pairs. Additionally, by providing results for a diverse set of features for comparison, we provide a baseline on the previously undocumented task of untranslated cross-language authorship attribution.
翻译:跨语言的作者归属问题取决于翻译,以便能够使用单一语言特征,或依赖语言的特征提取方法。直到最近,这个问题缺乏数据集,阻碍了后者的发展,而单一语言的解决方案是在机器翻译的Corsora上进行的。在本文中,我们提出了一个基于依赖图和通用语音标签(称为DT-gram(依赖树克))的作者分析的新的语言独立特征,这些特征是通过选择判决依赖性图表的具体子部分来构建的。我们通过对双语作者的未翻译数据集进行跨语言作者属性来评估DT-gram。我们通过对双语作者的未翻译数据集进行跨语言作者属性评估,显示它们平均达到五对不同语言的宏观平均F1分,比以往方法高出0.081分。此外,通过提供各种特征的比较结果,我们为以前没有翻译的跨语言作者归属的无文件记录的任务提供了基准。