Histograms, i.e., piece-wise constant approximations, are a popular tool used to represent data distributions. Traditionally, the difference between the histogram and the underlying distribution (i.e., the approximation error) is measured using the $L_p$ norm, which sums the differences between the two functions over all items in the domain. Although useful in many applications, the drawback of this error measure is that it treats approximation errors of all items in the same way, irrespective of whether the mass of an item is important for the downstream application that uses the approximation. As a result, even relatively simple distributions cannot be approximated by succinct histograms without incurring large error. In this paper, we address this issue by adapting the definition of approximation so that only the errors of the items that belong to the support of the distribution are considered. Under this definition, we develop efficient 1-pass and 2-pass streaming algorithms that compute near-optimal histograms in sub-linear space. We also present lower bounds on the space complexity of this problem. Surprisingly, under this notion of error, there is an exponential gap in the space complexity of 1-pass and 2-pass streaming algorithms. Finally, we demonstrate the utility of our algorithms on a collection of real and synthetic data sets.
翻译:直方图, 也就是说, 平方位常数近似, 是一个用来代表数据分布的工具。 传统上, 直方图和基本分布( 近似误) 之间的差别使用 $L_ p$ 标准来测量。 该标准将两个函数在域内所有项目上的差异相加。 虽然在许多应用中有用, 但错误测量的缺点是它以同样的方式处理所有项目的近似差错, 不论一个项目的质量对于使用近似值的下游应用是否重要。 因此, 即使是相对简单的分布也不能在不引起重大错误的情况下通过简洁的直方图( 近似误) 来比较。 在本文中, 我们通过调整近似值定义来解决这个问题, 从而只考虑属于分布支持对象的项目的差错 。 根据这个定义, 我们开发了高效的一通和二通流算法, 在亚线空间中, 也呈现了这个问题的空间复杂性的更低的界限。 令人惊讶的是, 在这个错误的这个概念下, 我们的合成算法系统收集了我们1号的指数级算法 。