A sketch is a probabilistic data structure used to record frequencies of items in a multi-set. Sketches are widely used in various fields, especially those that involve processing and storing data streams. In streaming applications with high data rates, a sketch "fills up" very quickly. Thus, its contents are periodically transferred to the remote collector, which is responsible for answering queries. In this paper, we propose a new sketch, called Slim-Fat (SF) sketch, which has a significantly higher accuracy compared to prior art, a much smaller memory footprint, and at the same time achieves the same speed as the best prior sketch. The key idea behind our proposed SF-sketch is to maintain two separate sketches: a small sketch called Slim-subsketch and a large sketch called Fat-subsketch. The Slim-subsketch is periodically transferred to the remote collector for answering queries quickly and accurately. The Fat-subsketch, however, is not transferred to the remote collector because it is used only to assist the Slim-subsketch during the insertions and deletions and is not used to answer queries. We implemented and extensively evaluated SF-sketch along with several prior sketches and compared them side by side. Our experimental results show that SF-sketch outperforms the most widely used CM-sketch by up to 33.1 times in terms of accuracy. We have released the source codes of our proposed sketch as well as existing sketches at Github. The short version of this paper will appear in ICDE 2017.
翻译:草图是一种用于记录多组设备中物品频率的概率性数据结构。 在多个领域, 特别是涉及处理和储存数据流的字段中, 切片被广泛使用。 在以高数据率流出的应用中, 草图“ 填满” 很快。 因此, 其内容被定期传送到远程收藏家, 负责回答询问。 在本文中, 我们提议了一个新的草图, 叫做 Slim- Fat (SF) 草图, 其精确度大大高于先前的艺术, 记忆足迹要小得多, 同时达到与先前的最佳草图相同的速度。 我们提议的SF- sk 的关键思想是保持两个不同的草图: 一个叫Slim- subsketch的小草图, 一个叫Fat-subskt。 其内容被定期传送到远程收藏家, 快速和准确地回答问题。 然而, 脂肪子草图没有被转移给远程收藏家, 因为它仅用来帮助插入和删除的Slim sub 。 我们的草图中的现有草图将广泛用来解结果作为我们先前的草图版本。