Operating systems have historically had to manage only a single type of memory device. The imminent availability of heterogeneous memory devices based on emerging memory technologies confronts the classic single memory model and opens a new spectrum of possibilities for memory management. Transparent data movement between different memory devices based on access patterns of applications is a desired feature to make optimal use of such devices and to hide the complexity of memory management to the end-user. However, capturing memory access patterns of an application at runtime comes at a cost, which is particularly challenging for large scale parallel applications that may be sensitive to system noise. In this work, we focus on the access pattern profiling phase prior to the actual memory relocation. We study the feasibility of using Intel's Processor Event-Based Sampling (PEBS) feature to record memory accesses by sampling at runtime and study the overhead at scale. We have implemented a custom PEBS driver in the IHK/McKernel lightweight multi-kernel operating system, one of whose advantages is minimal system interference due to the lightweight kernel's simple design compared to other OS kernels such as Linux. We present the PEBS overhead of a set of scientific applications and show the access patterns identified in noise-sensitive HPC applications. Our results show that clear access patterns can be captured with a 10% overhead in the worst-case and 1% in the best case when running on up to 128k CPU cores (2,048 Intel Xeon Phi Knights Landing nodes). We conclude that online memory access profiling using PEBS at large scale is promising for memory management in heterogeneous memory environments.
翻译:操作系统历来只管理单一类型的记忆装置。 以新兴记忆技术为基础的不同记忆装置即将出现, 从而面临经典的单一记忆模型, 并为记忆管理开辟了新的可能性范围。 基于应用程序访问模式的不同记忆装置之间的透明数据移动, 是优化使用这些装置和向最终用户隐藏记忆管理复杂性的理想特征。 然而, 在运行时获取一个应用程序的记忆存取模式要付出成本, 对于大规模平行应用程序来说, 这可能对系统噪音敏感。 在这项工作中, 我们侧重于在实际内存迁移之前的存取模式剖析阶段。 我们研究使用英特尔的处理器基于事件取样(PEBS) 功能的可行性, 以便通过运行时取样记录存储器的存取, 并将存储器的存取方式隐藏在IHK/McKERNel光量多内核操作系统中, 其中一个优点是, 最差的系统干扰可能最小, 因为轻量的内核内存系统与Linux等其它的内核内核内核内核内存系统。 我们使用运行时的内存大规模内存系统访问模式, 在运行的内存的内存程序上显示10个内存的内存的内存结果。 我们的内存的内存的内存的内存结果显示一个最清楚的内存的内存的内存的内存的内存, 。 我们显示的内存的内存的内存的内存的内存的内存的内存的内存的内存的内存的内存的内存的内存的内存的内存的内存, 。