Data is of high quality if it is fit for its intended use. The quality of data is influenced by the underlying data model and its quality. One major quality problem is the heterogeneity of data as quality aspects such as understandability and interoperability are impaired. This heterogeneity may be caused by quality problems in the data model. Data heterogeneity can occur in particular when the information given is not structured enough and just captured in data values, often due to missing or non-suitable structure in the underlying data model. We propose a bottom-up approach to detecting quality problems in data models that manifest in heterogeneous data values. It supports an explorative analysis of the existing data and can be configured by domain experts according to their domain knowledge. All values of a selected data field are clustered by syntactic similarity. Thereby an overview of the data values' diversity in syntax is provided. It shall help domain experts to understand how the data model is used in practice and to derive potential quality problems of the data model. We outline a proof-of-concept implementation and evaluate our approach using cultural heritage data.
翻译:如果数据适合预定使用,则数据的质量是高质量的。数据的质量受基本数据模型及其质量的影响。一个主要的质量问题在于数据的多样性,因为数据质量方面,例如易懂性和互操作性等质量方面受到损害。数据模型的质量问题可能造成这种异质性。数据异质性尤其可能发生,特别是当所提供的信息结构不够完善,而且只是以数据值来捕捉数据时,往往由于数据模型缺失或不适宜。我们建议采用自下而上的方法来发现以多元数据值显示的数据模型的质量问题。它支持对现有数据进行探索性分析,并且可以由域专家根据它们的领域知识配置。选定的数据领域的所有数值都由同义性相似性组合在一起。通过提供对数据值在语法中的多样性的概览,帮助域专家了解数据模型在实践中如何使用,并了解数据模型的潜在质量问题。我们概述了概念的验证实施情况,并用文化遗产数据数据数据数据数据数据数据来评估我们的方法。