The granularity of distributed computing is limited by communication time: there is no point in farming out smaller and smaller tasks if the communication overhead dominates the decrease in processing time due to the added parallelism. In this work, we leverage the low communication latency of a new NIC/CPU hardware design, the nanoPU, to explore a new extreme of granularity in distributed computation, where a problem is partitioned into tens of thousands of nanosecond-scale tasks. To evaluate the feasibility and practicality of extremely fine-grained computing, we built NanoSort, a distributed sorting algorithm running on the nanoPU. Using cycle-accurate FireSim simulations of 65,536 nanoPU cores, we show that NanoSort can sort 1M keys in 68$\mu$s, an order of magnitude faster than MilliSort, the current state-of-the-art.
翻译:分布式计算机的颗粒度因通信时间而受到限制:如果通信间接费用在加工时间的减少中占据主导地位,那么将小型和小型任务除去就没有任何意义了。 在这项工作中,我们利用新的NIC/CPU硬件设计(纳米PU)的低通信时空来探索分布式计算中的颗粒度新极端,即将问题分成数万纳米二等规模的任务。为了评估极细细微计算的可行性和实用性,我们建立了纳米Sort,这是在纳米PU上运行的分布式排序算法。我们使用65,536纳米PU核心的循环精确FiSim模拟,我们显示纳诺Sort可以将1M键以68美元计算,一个比MillSort(MillSort),即目前最先进的工艺,速度更快。