Inspired by the visualization of dental plaque at the dentist's office, this article proposes a novel visualization technique for identifying redundancies in relational data. Our approach builds upon an established information-theoretic framework that, despite being well-principled, remains unexplored in practical applications. In this framework, we calculate the information content (or entropy) of each cell in a relation instance, given a set of functional dependencies. The entropy value represents the likelihood of inferring the cell's value based on the dependencies and the remaining tuples. By highlighting cells with lower entropy, we effectively visualize redundancies in the data. We present an initial prototype implementation and demonstrate that a straightforward approach is insufficient for handling practical problem sizes. To address this limitation, we propose several optimizations, which we prove to be correct. Additionally, we present a Monte Carlo approximation technique with a known error, enabling computationally tractable computations. Using a real-world dataset of modest size, we illustrate the potential of our visualization technique. Our vision is to support domain experts with data profiling and data cleaning tasks, akin to the functionality of a plaque test at the dentist's.
翻译:暂无翻译