We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author or not. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars.
翻译:我们提出并提供了中世纪拉丁文本和美德拉丁文本的数据集,供用于计算作者分析的研究。美德拉丁文本和美德拉丁文本分别由294和30份汇编文本组成,由作者标注;美德拉丁文本具有教义性质,而美德拉丁文本则由文学评论和关于不同主题的论文组成。因此,这两个数据集有助于支持作者分析任务的研究,如作者归属、作者身份核查或同作者核查。除了我们提供的数据集外,我们还提供了从这些数据集获得的实验结果,用于作者核查任务,即预测未知作者的文本是否由候选作者撰写。我们还提供了我们使用的作者核查系统的源代码,从而允许复制我们的实验,供其他研究人员用作基准。我们还描述了上述作者核查系统的应用情况,利用这些数据集作为培训数据,调查两位争议的作者的中世纪研究学者。