We present the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains more than 91 million cases of reused text passages found in 4.2 million unique open-access publications. Featuring a high coverage of scientific disciplines and varieties of reuse, as well as comprehensive metadata to contextualize each case, our dataset addresses the most salient shortcomings of previous ones on scientific writing. Webis-STEREO-21 allows for tackling a wide range of research questions from different scientific backgrounds, facilitating both qualitative and quantitative analysis of the phenomenon as well as a first-time grounding on the base rate of text reuse in scientific publications.
翻译:我们展示了Webis-STEREO-21数据集,这是大量科学文本再利用在开放性出版物中的收集,它包含在420万个独特的开放性出版物中发现的超过9 100万个再利用文本段落的案例,由于对科学学科和再利用的种类以及综合元数据进行了大量覆盖,将每个案例的背景化,我们的数据集解决了以往科学著作中最突出的缺点。 Webis-STEREREO-21能够处理来自不同科学背景的广泛研究问题,促进了对这一现象的定性和定量分析,并首次以科学出版物中文本再利用的基本速度为基础。