All data is not equally popular. Often, some portion of data is more frequently accessed than the rest, which causes a skew in popularity of the data items. Adapting to this skew can improve performance, and this topic has been studied extensively in the past for disk-based settings. In this work, we consider an in-memory data structure, namely hash table, and show how one can leverage the skew in popularity for higher performance. Hashing is a low-latency operation, sensitive to the effects of caching, branch prediction, and code complexity among other factors. These factors make learning in-the-loop especially challenging as the overhead of performing any additional operations can be significant. In this paper, we propose VIP hashing, a fully online hash table method, that uses lightweight mechanisms for learning the skew in popularity and adapting the hash table layout. These mechanisms are non-blocking, and their overhead is controlled by sensing changes in the popularity distribution to dynamically switch-on/off the learning mechanism as needed. We tested VIP hashing against a variety of workloads generated by Wiscer, a homegrown hashing measurement tool, and find that it improves performance in the presence of skew (22% increase in fetch operation throughput for a hash table with one million keys under low skew, 77% increase under medium skew) while being robust to insert and delete operations, and changing popularity distribution of keys. We find that VIP hashing reduces the end-to-end execution time of TPC-H query 9, which is the most expensive TPC-H query, by 20% under medium skew.
翻译:并非所有数据都同样受欢迎。 通常, 某些部分的数据访问频率比其它部分更频繁, 这导致数据项目受欢迎程度下降。 适应此片可提高性能, 而这个话题在过去曾为基于磁盘的设置进行了广泛研究。 在这项工作中, 我们考虑一个模拟数据结构, 即 散列表, 并显示如何为更高的性能利用这种受欢迎程度。 散列是一个低延迟操作, 敏感于缓冲、 分支预测和代码复杂性的影响 。 这些因素使得当期学习特别具有挑战性, 因为执行任何额外操作的顶部可能很重要 。 在本文中, 我们提议要使用全线性散列表方法, 使用轻量机制来学习受欢迎程度的斜度, 即散列表。 这些机制是无阻力的, 其管理由感应感应感地显示性能分布的变化, 以动态的开关、 分支预测和代码的复杂性复杂。 我们测试了贵宾在进行一系列的递增量任务中, 在Wisererer公司下, 的递增量运行中, 正在不断测量。