Accelerators, like GPUs, have become a trend to deliver future performance desire, and sharing the same virtual memory space between CPUs and GPUs is increasingly adopted to simplify programming. However, address translation, which is the key factor of virtual memory, is becoming the bottleneck of performance for GPUs. In GPUs, a single TLB miss can stall hundreds of threads due to the SIMT execute model, degrading performance dramatically. Through real system analysis, we observe that the OS shows an advanced contiguity (e.g., hundreds of contiguous pages), and more large memory regions with advanced contiguity tend to be allocated with the increase of working sets. Leveraging the observation, we propose MESC to improve the translation efficiency for GPUs. The key idea of MESC is to divide each large page frame (2MB size) in virtual memory space into memory subregions with fixed size (i.e., 64 4KB pages), and store the contiguity information of subregions and large page frames in L2PTEs. With MESC, address translations of up to 512 pages can be coalesced into single TLB entry, without the needs of changing memory allocation policy (i.e., demand paging) and the support of large pages. In the experimental results, MESC achieves 77.2% performance improvement and 76.4% reduction in dynamic translation energy for translation-sensitive workloads.
翻译:加速器,如GPU,已经成为未来实现业绩愿望的趋势,并且越来越采用CPU和GPU之间共享相同的虚拟记忆空间来简化编程。但是,地址翻译(这是虚拟记忆的关键要素)正在成为GPU性能的瓶颈。在GPU中,单个TLB错误会因为SIMT执行模式而拖累数百条线索,其性能会大大降低。通过真正的系统分析,我们观察到OS显示一个先进的连续性(例如,数百个相连页面),以及更多具有高级毗连的记忆区会随着工作设置的增加而分配。我们利用观察,我们建议MESC提高GPU的翻译效率。MESC的关键想法是将虚拟记忆空间中的每个大页框架(2MB大小)分成固定大小的记忆区(即,64 4KB页),并将各次区域的连续性信息和大页面框架储存在L2PTPTEDE中。由于MESC,最多达512页的地址翻译会随着工作组合而增加。我们建议MESC 提高77LB的翻译效率,不需要将大规模的磁性政策翻译。