中世纪拉丁文本计算作者分析的两套数据集 (MedLatinEpi and MedLatinLit: Two Datasets for the Computational Authorship Analysis of Medieval Latin Texts)

We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author or not. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also describe the application of the above authorship verification system, using these datasets as training data, for investigating the authorship of two medieval epistles whose authorship has been disputed by scholars.

翻译：我们提出并提供了中世纪拉丁文本和美德拉丁文本的数据集,供用于计算作者分析的研究。美德拉丁文本和美德拉丁文本分别由294和30份汇编文本组成,由作者标注;美德拉丁文本具有教义性质,而美德拉丁文本则由文学评论和关于不同主题的论文组成。因此,这两个数据集有助于支持作者分析任务的研究,如作者归属、作者身份核查或同作者核查。除了我们提供的数据集外,我们还提供了从这些数据集获得的实验结果,用于作者核查任务,即预测未知作者的文本是否由候选作者撰写。我们还提供了我们使用的作者核查系统的源代码,从而允许复制我们的实验,供其他研究人员用作基准。我们还描述了上述作者核查系统的应用情况,利用这些数据集作为培训数据,调查两位争议的作者的中世纪研究学者。

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日