Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains k individually uniform sets that are collectively diverse to be representative. Uniformity addresses interpretability and diversity addresses representativity. Finding such as summary is a difficult task when data is highly diverse and large. We examine the applicability of Exploratory Data Analysis (EDA) to data summarization and formalize Eda4Sum, the problem of guided exploration of data summaries that seeks to sequentially produce connected summaries with the goal of maximizing their cumulative utility. EdA4Sum generalizes one-shot summarization. We propose to solve it with one of two approaches: (i) Top1Sum which chooses the most useful summary at each step; (ii) RLSum which trains a policy with Deep Reinforcement Learning that rewards an agent for finding a diverse and new collection of uniform sets at each step. We compare these approaches with one-shot summarization and top-performing EDA solutions. We run extensive experiments on three large datasets. Our results demonstrate the superiority of our approaches for summarizing very large data, and the need to provide guidance to domain experts.
翻译:数据总和是生成一个输入数据集的可解释和具有代表性的子集的过程,通常采用一次性程序进行,目的是找到最佳摘要。有用的摘要包含各式各样的、具有代表性的、具有集体多样性的、具有代表性的、各式各样的集成。统一性处理可解释性和多样性的问题,多样性处理具有代表性的问题。当数据非常多样和庞大时,诸如摘要等发现是一项艰巨的任务。我们研究探索数据分析(EDA)对数据总和和和和正式化的适用性,这是对数据摘要的有指导的探讨问题,目的是以循序渐进的方式产生相联的摘要,以尽量扩大其累积的效用。EdA4Sum 概括地将一集一集之集之集为总和。我们建议用两种方法中的一种方法加以解决:(i) Top1Sum,在每一步骤选择最有用的摘要;(ii) RLSum,在深度强化学习中培训一项政策,奖励一个在每一步中找到多样化和新套统一数据集的代理人。我们将这些方法与一集成一集之和最优秀的EDA解决方案进行比较。我们在三大领域进行广泛的实验,我们的数据分析。