High data quality is fundamental for today's AI-based systems. However, although data quality has been an object of research for decades, there is a clear lack of research on potential data quality issues (e.g., ambiguous, extraneous values). These kinds of issues are latent in nature and thus often not obvious. Nevertheless, they can be associated with an increased risk of future problems in AI-based systems (e.g., technical debt, data-induced faults). As a counterpart to code smells in software engineering, we refer to such issues as Data Smells. This article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems. In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented. Moreover, the article outlines tool support for detecting data smells and presents the result of an initial smell detection on more than 240 real-world datasets.
翻译:高数据质量是当今基于AI的系统的基础,然而,尽管数据质量几十年来一直是研究对象,但显然缺乏关于潜在数据质量问题的研究(例如,模糊、不相干的价值),这些问题具有潜在性质,因此往往不明显,然而,它们可能与基于AI的系统中今后出现问题的风险增加有关(例如,技术债务、数据引起的错误),作为软件工程编码闻闻的对口单位,我们提到诸如数据嗅觉等问题,文章从概念上阐述了数据在基于AI的系统中的气味和数据的原因、后果、探测和使用,此外,还提出了36个数据嗅觉目录,分为三类(即,可相信性嗅觉、可理解性嗅觉、一致性嗅觉),此外,文章概述了为探测数据气味提供的工具支持,并介绍了240多个真实世界数据集初步嗅觉的结果。