While the publication of datasets in scientific repositories has become broadly recognised, the repositories tend to have increasing semantic-related problems. For instance, they present various data reuse obstacles for machine-actionable processes, especially in biological repositories, hampering the reproducibility of scientific experiments. An example of these shortcomings is the GenBank database. We propose GAP, an innovative data model to enhance the semantic data meaning to address these issues. The model focuses on converging related approaches like data provenance, semantic interoperability, FAIR principles, and nanopublications. Our experiments include a prototype to scrape genomic data and trace them to nanopublications as a proof of concept. For this, (meta)data are stored in a three-level nanopub data model. The first level is related to a target organism, specifying data in terms of biological taxonomy. The second level focuses on the biological strains of the target, the central part of our contribution. The strains express information related to deciphered (meta)data of the genetic variations of the genomic material. The third level stores related scientific papers (meta)data. We expect it will offer higher data storage flexibility and more extensive interoperability with other data sources by incorporating and adopting associated approaches to store genomic data in the proposed model.
翻译:虽然科学储存库中数据集的公布已经得到广泛承认,但储存库往往会遇到越来越多的语义相关问题,例如,它们为机器可操作过程,特别是生物储存库提供了各种数据再利用障碍,妨碍了科学实验的再生。这些缺点的一个例子是GenBank数据库。我们提议GenBank数据库,这是一个创新的数据模型,目的是加强用于解决这些问题的语义数据的含义。模型侧重于数据出处、语义互通性、FAIR原则和纳米出版物等相关方法的融合。我们的实验包括一个原型的基因组数据,并将这些数据追踪到纳米出版物,作为概念的证明。为此,(元)数据储存在三级纳米图数据模型中。第一级与目标生物机体有关,具体说明生物分类学数据的含义。第二层次侧重于目标的生物菌株,即我们贡献的核心部分。关于基因学材料基因变异的解码(元)数据。第三级储存库(元数据)将采用较先进的科学互可操作性方法,并将其他数据纳入更广泛的储存数据。我们期望通过其他数据提供较先进的储存数据。