With an ever-growing number of new publications each day, scientific writing poses an interesting domain for authorship analysis of both single-author and multi-author documents. Unfortunately, most existing corpora lack either material from the science domain or the required metadata. Hence, we present SMAuC, a new metadata-rich corpus designed specifically for authorship analysis in scientific writing. With more than three million publications from various scientific disciplines, SMAuC is the largest openly available corpus for authorship analysis to date. It combines a wide and diverse range of scientific texts from the humanities and natural sciences with rich and curated metadata, including unique and carefully disambiguated author IDs. We hope SMAuC will contribute significantly to advancing the field of authorship analysis in the science domain.
翻译:由于每天都有越来越多的新出版物,科学著作为作者分析单作者和多作者文件提供了一个有趣的领域,不幸的是,大多数现有的公司缺乏科学领域的材料或所需的元数据,因此,我们介绍了SMAuC,这是专为科学著作的作者分析而设计的新的元数据丰富数据库,有300多万份来自不同科学学科的出版物,SMAuC是迄今可供公开查阅的最大作者分析资料,它把人文和自然科学的广泛和多样的科学文本与丰富和经整理的元数据,包括独特和仔细分离的作者身份,结合起来,我们希望SMAuC将大大有助于推进科学领域的作者分析领域。