Long-term data-driven studies have become indispensable in many areas of science. Often, the data formats, structures and semantics of data change over time, the data sets evolve. Therefore, studies over several decades in particular have to consider changing database schemas. The evolution of these databases lead at some point to a large number of schemas, which have to be stored and managed, costly and time-consuming. However, in the sense of reproducibility of research data each database version must be reconstructable with little effort. So a previously published result can be validated and reproduced at any time. Nevertheless, in many cases, such an evolution can not be fully reconstructed. This article classifies the 15 most frequently used schema modification operators and defines the associated inverses for each operation. For avoiding an information loss, it furthermore defines which additional provenance information have to be stored. We define four classes dealing with dangling tuples, duplicates and provenance-invariant operators. Each class will be presented by one representative. By using and extending the theory of schema mappings and their inverses for queries, data analysis, why-provenance, and schema evolution, we are able to combine data analysis applications with provenance under evolving database structures, in order to enable the reproducibility of scientific results over longer periods of time. While most of the inverses of schema mappings used for analysis or evolution are not exact, but only quasi-inverses, adding provenance information enables us to reconstruct a sub-database of research data that is sufficient to guarantee reproducibility.
翻译:长期的数据驱动研究在许多科学领域已经变得不可或缺。通常,数据格式、结构以及数据变化的语义随着时间推移而变化。因此,数十年来的研究尤其需要考虑不断变化的数据库模式。这些数据库的演变在某些时候会导致大量必须储存和管理、费用昂贵和耗时的系统模式。然而,从研究数据的可复制的意义上讲,每个数据库版本都必须在很少努力的情况下进行重建。因此,可以随时验证和复制以前公布的结果。然而,在许多情况下,这种演变无法完全重建。这篇文章将最经常使用的系统修改操作者分类为15个最经常使用的系统修改操作者,并界定了每项操作的相关逆向。为了避免信息丢失,这些数据库还界定了哪些额外的源代码信息必须储存。我们定义了四类处理正在变换的图、复制品和证明的反向操作者。每类只能由一名代表介绍。通过使用和扩展模型绘制的理论及其反向数据演变的理论,无法使我们查询、数据分析的准确性分析得以进行。为什么在不断演变的数据转换过程中,我们利用了科学变现的系统将数据结果加以整合。