在管理大数据分析器中从序列化和垃圾收集中释放计算库和垃圾收藏 (Freeing Compute Caches from Serialization and Garbage Collection in Managed Big Data Analytics)

Managed analytics frameworks (e.g., Spark) cache intermediate results in memory (on-heap) or storage devices (off-heap) to avoid costly recomputations, especially in graph processing. As datasets grow, on-heap caching requires more memory for long-lived objects, resulting in high garbage collection (GC) overhead. On the other hand, off-heap caching moves cached objects on the storage device, reducing GC overhead, but at the cost of serialization and deserialization (S/D). In this work, we propose TeraHeap, a novel approach for providing large analytics caches. TeraHeap uses two heaps within the JVM (1) a garbage-collected heap for ordinary Spark objects and (2) a large heap memory-mapped over fast storage devices for cached objects. TeraHeap eliminates both S/D and GC over cached data without imposing any language restrictions. We implement TeraHeap in Oracle's Java runtime (OpenJDK-1.8). We use five popular, memory-intensive graph analytics workloads to understand S/D and GC overheads and evaluate TeraHeap. TeraHeap improves total execution time compared to state-of-the-art Apache Spark configurations by up to 72% and 81% for NVMe SSD and non-volatile memory, respectively. Furthermore, TeraCache requires 8x less DRAM capacity to provide performance comparable or higher than native Spark. This paper opens up emerging memory and storage devices for practical use in scalable analytics caching.

翻译：管理分析框架( 如, Spark ) 缓存中间结果, 包括记忆( 上层) 或存储装置( 上层), 以避免成本高昂的再比较, 特别是在图形处理中。随着数据集的成长, 上层缓存需要更多长期物体的记忆, 从而导致大量垃圾收集( GC) 间接费用。另一方面, 超层缓存移动存储设备上的缓存对象, 减少 GC 间接费用, 但以序列化和消化( S/D) 为代价。在这项工作中, 我们提议TeraHeap, 提供大型解析能力新颖的方法。在 JVM 中, TeraHeap 使用两层堆积, 普通的堆积, 堆积缓存的缓存需要更多记忆存储器, 快速存储器的缓存设备, 而不施加任何语言限制。我们用TeraHerea Heap 启动时( Openja- D.