Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive amounts of data that do not always fit on the heap. Therefore, frameworks temporarily move long-lived objects outside the managed heap (off-heap) on a fast storage device. Unfortunately, this practice results in: (1) high serialization/deserialization (S/D) cost, and (2) high memory pressure when off-heap objects are moved back to the managed heap for processing. In this paper, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of the objects in big data frameworks. TeraHeap relies on three concepts. (1) It eliminates S/D cost by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It reduces GC cost by fencing the garbage collector from scanning H2 objects. (3) It offers a simple hint-based interface, which allows frameworks to leverage knowledge about objects for populating H2. We implement TeraHeap in OpenJDK and evaluate it with 15 widely used applications in two real-world big data frameworks, Spark and Giraph. Our evaluation shows that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph, respectively. Also, it provides better performance by consuming up to 8x and 1.2x less DRAM capacity than native Spark and Giraph, respectively. Finally, it outperforms Panthera, a garbage collector for hybrid memories, by up to 69%.
翻译:大型数据分析框架,如Spark 和 Giraph, 需要处理和隐藏大量不总是适合堆积的巨量数据。 因此, 框架暂时将长寿命对象移到快速存储装置上管理型堆肥( 脱层) 之外。 不幸的是, 这种做法导致:(1) 高序列/ 脱层( S/D) 成本, 以及(2) 脱层物体被移回管理型堆肥处理时的高内存压力。 在本文中, 我们提议TeraHeap, 该系统可以消除大数据框架中大部分物体的S/ D 间接费用和昂贵的 GC 扫描。 TeraHeap 依赖三个概念。 (1) 通过延长管理型堆积( JVM) 使用第二高容量堆肥( H2) 成本。 (2) 将垃圾收集器从扫描 H2 对象的废渣收集器移回到管理型堆肥的堆肥堆肥。 (3) 它提供简单、 暗示性的接口, 从而能够利用对 H2 目标的了解。 我们在 OpJeah DK 和 Giral 分别实施比 Riral 和 18 的 格式 格式 格式 格式 格式 的Serma 格式 。