This paper presents an analysis of the publication of datasets collected via Google Dataset Search, specialized in families of RNA viruses, whose terminology was obtained from the National Cancer Institute (NCI) thesaurus developed by the US Department of Health and Human Services. The objective is to determine the scope and reuse capacity of the available data, determine the number of datasets and their free access, the proportion in reusable download formats, the main providers, their publication chronology, and to verify their scientific provenance. On the other hand, we also define possible relationships between the publication of datasets and the main pandemics that have occurred during the last 10 years. The results obtained highlight that only 52% of the datasets are related to scientific research, while an even smaller fraction (15%) are reusable. There is also an upward trend in the publication of datasets, especially related to the impact of the main epidemics, as clearly confirmed for the Ebola virus, Zika, SARS-CoV, H1N1, H1N5, and especially the SARS-CoV-2 coronavirus. Finally, it is observed that the search engine has not yet implemented adequate methods for filtering and monitoring the datasets. These results reveal some of the difficulties facing open science in the dataset field.
翻译:本文分析了通过谷歌数据集搜索收集的数据集的出版情况,该数据集的术语来自美国卫生和公众服务部开发的国家癌症研究所(NCI)术语词库,目的是确定现有数据的范围和再利用能力,确定数据集的数量及其自由获取、可重复下载格式中的比例、主要提供者、其出版时间顺序,并核实其科学出处。另一方面,我们还界定了数据集的公布与过去10年中发生的主要流行病之间的关系。获得的结果显示,只有52%的数据集与科学研究有关,而更小的部分(15%)是可重新使用的。此外,在公布数据集方面也出现了上升趋势,特别是主要流行病的影响,这已得到明确证实的是埃博拉病毒、Zika、SARS-COV、H1N1、H1N5,特别是SA-COV-2 Corona病毒。最后,发现这些搜索引擎尚未在科学领域采用适当的筛选方法来对数据进行监测。