Can we infer sources of errors from outputs of the complex data analytics software? Bidirectional programming promises that we can reverse flow of software, and translate corrections of output into corrections of either input or data analysis. This allows us to achieve holy grail of automated approaches to debugging, risk reporting and large scale distributed error tracking. Since processing of risk reports and data analysis pipelines can be frequently expressed using a sequence relational algebra operations, we propose a replacement of this traditional approach with a data summarization algebra that helps to determine an impact of errors. It works by defining data analysis of a necessarily complete summarization of a dataset, possibly in multiple ways along multiple dimensions. We also present a description to better communicate how the complete summarizations of the input data may facilitates easier debugging and more efficient development of analysis pipelines. This approach can also be described as an generalization of axiomatic theories of accounting into data analytics, thus dubbed data accounting. We also propose formal properties that allow for transparent assertions about impact of individual records on the aggregated data and ease debugging by allowing to find minimal changes that change behaviour of data analysis on per-record basis.
翻译:我们能否从复杂的数据分析软件的产出中推断出错误的来源? 双向编程承诺我们可以逆转软件的流动,并将产出的校正转换成对输入或数据分析的校正。这使我们能够实现自动调试、风险报告和大规模分布式错误追踪方法的神圣结构。由于对风险报告和数据分析管道的处理可以经常使用序列关系代数操作来表示,我们建议用数据总和代数取代这一传统方法,帮助确定错误的影响。它可以界定数据分析,对数据集进行必然完整的综合,可能采用多种方式对多个层面进行校正。我们还提出一个说明,以更好地说明输入数据数据的完整汇总如何有助于更容易地调试和更有效地开发分析管道。这个方法也可以被描述为数据分析分析学的轴学理论的普遍化,从而进行虚构的数据核算。我们还提出了正式的特性,以便能够以透明的方式断言个人记录对综合数据的影响,并通过找到最起码的变化数据分析基础,从而容易进行调试。