Imperfect databases are very common in many applications due to various reasons ranging from data-entry errors, transmission or integration errors, and wrong instruments' readings, to faulty experimental setups leading to incorrect results. The management and query processing of imperfect databases is a very challenging problem as it requires incorporating the data's qualities within the database engine. Even more challenging, the qualities are typically not static and may evolve over time. Unfortunately, most of the state-of-art techniques deal with the data quality problem as an offline task that is in total isolation of the query processing engine (carried out outside the DBMS). Hence, end-users will receive the queries' results with no clue on whether or not the results can be trusted for further analysis and decision making. In this paper, we propose the it "QTrail-DB" system that fundamentally extends the standard DBMSs to support imperfect databases with evolving qualities. QTrail-DB introduces a new quality model based on the new concept of "Quality Trails", which captures the evolution of the data's qualities over time. QTrail-DB extends the relational data model to incorporate the quality trails within the database system. We propose a new query algebra, called "QTrail Algebra", that enables seamless and transparent propagation and derivations of the data's qualities within a query pipeline. As a result, a query's answer will be automatically annotated with quality-related information at the tuple level. QTrail-DB propagation model leverages the thoroughly-studied propagation semantics present in the DB provenance and lineage tracking literature, and thus there is no need for developing a new query optimizer. QTrail-DB is developed within PostgreSQL and experimentally evaluated using real-world datasets to demonstrate its efficiency and practicality.
翻译:由于从数据输入错误、传输或整合错误、仪器读数错误到错误的实验设置导致不正确的结果等各种原因,许多应用程序的不完善数据库非常常见。不完善数据库的管理和查询处理是一个非常具有挑战性的问题,因为它需要将数据质量纳入数据库引擎。甚至更具挑战性,这些质量通常不是静止的,而且可能随着时间的演变而变化。不幸的是,大多数最先进的技术将数据质量问题作为离线任务处理,完全孤立于查询处理引擎(在 DBMS 之外运行)。因此,终端用户将接受查询结果,而没有线索说明是否可以信任结果来进行进一步的分析和决策。在本文件中,我们建议“QTrailt-DB”系统从根本上扩展标准DBMS系统以支持质量不完善的数据库。QTrailt-DB基于“质量轨迹”的新概念引入一个新的质量模型,该模型将反映数据质量的演变情况。Dril-T-DB在数据库内部将自动更新数据跟踪数据跟踪和升级数据数据库。我们提议在数据库内部建立一个质量数据库。</s>