Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., the international classification of Diseases-9, Medical Subject Heading, and Gene Ontology, to name a few. In this paper, we study the problem of selecting a diverse set of k elements to summarize an input dataset with hierarchical terminologies, and visualize the summary in an ontology structure. We propose an efficient greedy algorithm to solve the problem with (1-1/e) = 62% -approximation guarantee. Although this greedy solution achieves quality-guaranteed answers approximately but it is still not optimal. To tackle the problem optimally, we further develop a dynamic programming algorithm to obtain optimal answers for graph visualization of log-data using ontology terminologies called OVDO . The complexity and correctness of OVDO are theoretically analyzed. In addition, we propose a useful optimization technique of tree reduction to remove useless nodes with zero weights and shrink the tree into a smaller one, which ensures the efficiency acceleration of OVDO in many real-world applications. Extensive experimental results on real-world datasets show the effectiveness and efficiency of our proposed approximate and exact algorithms for tree data summarization.
翻译:向用户提供一组小数据集的数据总和,在众多应用和系统中广泛应用了向用户提供一小部分数据集的数据总和。许多数据集都以等级术语编码,例如疾病9-9的国际分类、医学主题标题和基因本体学等。在本文件中,我们研究如何选择一套多样化的 k 元素来总结带有等级术语的输入数据集,并在本体结构中直观地分析该摘要。我们建议一种高效的贪婪算法来解决问题,用(1-1/e)=62% - 匹配保证。虽然这种贪婪的解决方案可以大致实现质量保证的答案,但仍然是不理想的。为了以最佳方式解决问题,我们进一步制定动态的编程算法,以获得最佳的答案,用称为OVDO的术语来对日志数据进行图表直观化。对OVDO的复杂性和正确性进行了理论上的分析。此外,我们建议一种有用的优化方法,用树减少无用的节点来消除零重量的节点,并将树压缩成一个小的树,但仍然是最佳的答案。为了最优化地解决问题,我们所拟的OVAVD- 和最接近的实验性世界数据的效率,以显示许多的实验性世界的加速数据的效率。