Exploratory data analytics (EDA) is a sequential decision making process where analysts choose subsequent queries that might lead to some interesting insights based on the previous queries and corresponding results. Data processing systems often execute the queries on samples to produce results with low latency. Different downsampling strategy preserves different statistics of the data and have different magnitude of latency reductions. The optimum choice of sampling strategy often depends on the particular context of the analysis flow and the hidden intent of the analyst. In this paper, we are the first to consider the impact of sampling in interactive data exploration settings as they introduce approximation errors. We propose a Deep Reinforcement Learning (DRL) based framework which can optimize the sample selection in order to keep the analysis and insight generation flow intact. Evaluations with 3 real datasets show that our technique can preserve the original insight generation flow while improving the interaction latency, compared to baseline methods.
翻译:探索性数据分析(EDA)是一个顺序决策过程,分析员根据先前的查询和相应的结果选择随后的查询,从而可能导致一些有趣的洞察力;数据处理系统往往对样品进行查询,以便以低潜度得出结果;不同的下游抽样战略保留了数据的不同统计数据,并有不同的延缓度减少幅度;抽样战略的最佳选择往往取决于分析流程分析的特定背景和分析员的隐藏意图。在本文中,我们首先考虑交互式数据勘探环境中取样的影响,因为它们引入近似误差。我们提议了一个基于深强化学习(DRL)的框架,可以优化抽样选择,以便保持分析和洞察生成的完整。有3个实际数据集的评价表明,我们的技术可以保持原始的洞察生成流,同时改进与基线方法的交互延缓度。