谷歌数据集搜索中脊核核酸病毒和冠状病毒:其范围和流行病学相关性 (Ribonucleic acid (RNA) virus and coronavirus in Google Dataset Search: their scope and epidemiological correlation)

This paper presents an analysis of the publication of datasets collected via Google Dataset Search, specialized in families of RNA viruses, whose terminology was obtained from the National Cancer Institute (NCI) thesaurus developed by the US Department of Health and Human Services. The objective is to determine the scope and reuse capacity of the available data, determine the number of datasets and their free access, the proportion in reusable download formats, the main providers, their publication chronology, and to verify their scientific provenance. On the other hand, we also define possible relationships between the publication of datasets and the main pandemics that have occurred during the last 10 years. The results obtained highlight that only 52% of the datasets are related to scientific research, while an even smaller fraction (15%) are reusable. There is also an upward trend in the publication of datasets, especially related to the impact of the main epidemics, as clearly confirmed for the Ebola virus, Zika, SARS-CoV, H1N1, H1N5, and especially the SARS-CoV-2 coronavirus. Finally, it is observed that the search engine has not yet implemented adequate methods for filtering and monitoring the datasets. These results reveal some of the difficulties facing open science in the dataset field.

翻译：本文分析了通过谷歌数据集搜索收集的数据集的出版情况,该数据集的术语来自美国卫生和公众服务部开发的国家癌症研究所(NCI)术语词库,目的是确定现有数据的范围和再利用能力,确定数据集的数量及其自由获取、可重复下载格式中的比例、主要提供者、其出版时间顺序,并核实其科学出处。另一方面,我们还界定了数据集的公布与过去10年中发生的主要流行病之间的关系。获得的结果显示,只有52%的数据集与科学研究有关,而更小的部分(15%)是可重新使用的。此外,在公布数据集方面也出现了上升趋势,特别是主要流行病的影响,这已得到明确证实的是埃博拉病毒、Zika、SARS-COV、H1N1、H1N5,特别是SA-COV-2 Corona病毒。最后,发现这些搜索引擎尚未在科学领域采用适当的筛选方法来对数据进行监测。

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

67+阅读 · 2020年7月25日

因果图，Causal Graphs，52页ppt

专知会员服务

253+阅读 · 2020年4月19日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日