通过对不同不同数据值进行分组,检测数据模型的质量问题 (Detecting Quality Problems in Data Models by Clustering Heterogeneous Data Values)

from arxiv, 17 pages. This paper is an extended version of a paper to be published in "MoDELS '21: ACM/IEEE 24th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings". It was presented at the 3rd Workshop on Artificial Intelligence and Model-driven Engineering

Data is of high quality if it is fit for its intended use. The quality of data is influenced by the underlying data model and its quality. One major quality problem is the heterogeneity of data as quality aspects such as understandability and interoperability are impaired. This heterogeneity may be caused by quality problems in the data model. Data heterogeneity can occur in particular when the information given is not structured enough and just captured in data values, often due to missing or non-suitable structure in the underlying data model. We propose a bottom-up approach to detecting quality problems in data models that manifest in heterogeneous data values. It supports an explorative analysis of the existing data and can be configured by domain experts according to their domain knowledge. All values of a selected data field are clustered by syntactic similarity. Thereby an overview of the data values' diversity in syntax is provided. It shall help domain experts to understand how the data model is used in practice and to derive potential quality problems of the data model. We outline a proof-of-concept implementation and evaluate our approach using cultural heritage data.

翻译：如果数据适合预定使用,则数据的质量是高质量的。数据的质量受基本数据模型及其质量的影响。一个主要的质量问题在于数据的多样性,因为数据质量方面,例如易懂性和互操作性等质量方面受到损害。数据模型的质量问题可能造成这种异质性。数据异质性尤其可能发生,特别是当所提供的信息结构不够完善,而且只是以数据值来捕捉数据时,往往由于数据模型缺失或不适宜。我们建议采用自下而上的方法来发现以多元数据值显示的数据模型的质量问题。它支持对现有数据进行探索性分析,并且可以由域专家根据它们的领域知识配置。选定的数据领域的所有数值都由同义性相似性组合在一起。通过提供对数据值在语法中的多样性的概览,帮助域专家了解数据模型在实践中如何使用,并了解数据模型的潜在质量问题。我们概述了概念的验证实施情况,并用文化遗产数据数据数据数据数据数据数据来评估我们的方法。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/