As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of the data. For example, a health-monitoring system that is designed under the assumption that weight is reported in imperial units (lbs) will malfunction when encountering weight reported in metric units (kilograms). Similar to software debugging, which aims to find bugs in the mechanism (source code or runtime conditions), our goal is to debug the data to identify potential sources of disconnect between the assumptions about the data and the systems that operate on that data. Specifically, we seek which properties of the data cause a data-driven system to malfunction. We propose DataExposer, a framework to identify data properties, called profiles, that are the root causes of performance degradation or failure of a system that operates on the data. Such identification is necessary to repair the system and resolve the disconnect between data and system. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataExposer alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataExposer reports causally verified root causes, in terms of data profiles, of the system malfunction. We empirically evaluate DataExposer on three real-world and several synthetic data-driven systems that fail on datasets due to a diverse set of reasons. In all cases, DataExposer identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques.
翻译:由于数据是许多现代系统的核心组成部分,因此系统故障的原因可能存在于数据中,特别是数据的具体特性。例如,一个健康监测系统,其设计所依据的假设是,帝国单位(lbs)的重量报告在公制单位(公斤)遇到重量报告时,重量报告会发生故障。与软件调试相似,软件调试的目的是找出机制中的错误(源代码或运行时间条件),我们的目标是调试数据,以查明数据假设与数据运行系统之间脱节的潜在来源。具体地说,我们寻求数据属性导致数据驱动的系统故障。我们提出数据勘探仪,这是一个用于确定数据属性的框架,称为剖析仪,这是数据运行的系统性能退化或故障的根源。这种识别对于修复系统并解决数据和系统之间的脱节问题十分必要。我们的技术是以因果推理为基础的:当系统因应变的干涉发生故障时,数据分析员改变数据剖析系统的数据剖析和观察系统因实际变化而出现的变化,而需要精确的根基值。我们提出数据剖析器分析数据结果,而导致数据结果分析。