Many recent papers have demonstrated fast in-network computation using programmable switches, running many orders of magnitude faster than CPUs. The main limitation of writing software for switches is the constrained programming model and limited state. In this paper we explore whether a new type of CPU, called the nanoPU, offers a useful middle ground, with a familiar C/C++ programming model, and potentially many terabits/second of packet processing on a single chip, with an RPC response time less than 1 $\mu$s. To evaluate the nanoPU, we prototype and benchmark three common network services: packet classification, network telemetry report processing, and consensus protocols on the nanoPU. Each service is evaluated using cycle-accurate simulations on FPGAs in AWS. We found that packets are classified 2$\times$ faster and INT reports are processed more than an order of magnitude quickly than state-of-the-art approaches. Our production quality Raft consensus protocol, running on the nanoPU, writes to a 3-way replicated key-value store (MICA) in 3 $\mu$s, twice as fast as the state-of-the-art, with 99\% tail latency of only 3.26 $\mu$s. To understand how these services can be combined, we study the design and performance of a {\em network reflex plane}, designed to process telemetry data, make fast control decisions, and update consistent, replicated state within a few microseconds.
翻译:最近的许多论文展示了使用可编程开关的快速网络内计算,运行速度比CPU快得多。开关的写写软件的主要限制是有限的编程模型和有限状态。在本文中,我们探索的是,新型的称为纳米PU的CPU是否提供了一个有用的中间点,它有一个熟悉的C/C++编程模型,以及可能有许多梯式/秒的单芯片包装处理,而RPC的反应时间小于1美元。为了评估纳米PU,我们原型和基准三个共同网络服务:组合分类、网络遥测报告处理和关于纳米PU的共识协议。每个服务都使用AWS中FGA的周期精确模拟来进行评估。我们发现,包的分类为2美元/美元,速度更快,而INT报告处理速度超过一个数量级的顺序,比最先进的方法要快。我们的生产质量拉ft共识协议,在纳米PU上运行,写给一个三流式重重重重重的关键价值商店(MIA),3美元,两次是快速地进行设计。