Recently there has been increased interest in using machine learning techniques to improve classical algorithms. In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies. Such structures are now central components of large-scale data analytics and machine learning pipelines. However, many common functions, such as thresholds and p-th frequency moments with p > 2, are known to require polynomial-size sketches in the worst case. We explore performance beyond the worst case under two different types of assumptions. The first is having access to noisy advice on item frequencies. This continues the line of work of Hsu et al. (ICLR 2019), who assume predictions are provided by a machine learning model. The second is providing guaranteed performance on a restricted class of input frequency distributions that are better aligned with what is observed in practice. This extends the work on heavy hitters under Zipfian distributions in a seminal paper of Charikar et al. (ICALP 2002). Surprisingly, we show analytically and empirically that "in practice" small polylogarithmic-size sketches provide accuracy for "hard" functions.
翻译:最近人们越来越有兴趣使用机器学习技术来改进古典算法。 在本文中,当有可能根据数据频率的功能,为加权抽样和统计估计而构建紧凑、可作成的草图时,我们研究的是:这种结构现在是大规模数据分析分析和机器学习管道的核心组成部分,然而,许多常见功能,如门槛和p-th频率,p > 2,已知在最坏的情况下需要多米大小的草图。我们在两种不同的假设中探索最坏的外表。第一个假设是能够获得关于项目频率的噪音咨询。这延续了Hsu等人的工作(ICLR 2019),他假定预测是由机器学习模型提供的。第二个假设是在有限类别的输入频率分布上提供保证的性能,这与实际所观察到的更加一致。这扩大了在Charikar等人的原始论文中分发的重击手的工作(Cripfian等人(CricalP2002年)中,我们从分析角度和实验性地展示了“在实践中”小多元缩缩缩图的精确性功能。