Responding to the "datacenter tax" and "killer microseconds" problems for datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle the limitations of the current solutions, this paper proposes ORCA, a holistic network and architecture co-design solution that leverages current RDMA and emerging cache-coherent off-chip interconnect technologies. Specifically, ORCA consists of four hardware and software components: (1) unified abstraction of inter- and intra-machine communications managed by one-sided RDMA write and cache-coherent memory write; (2) efficient notification of requests to accelerators assisted by cache coherence; (3) cache-coherent accelerator architecture directly processing requests received by NIC; and (4) adaptive device-to-host data transfer for modern server memory systems consisting of both DRAM and NVM exploiting state-of-the-art features in CPUs and PCIe. We prototype ORCA with a commercial system and evaluate three popular datacenter applications: in-memory key-value store, chain replication-based distributed transaction system, and deep learning recommendation model inference. The evaluation shows that ORCA provides 30.1~69.1% lower latency, up to 2.5x higher throughput, and 3x higher power efficiency than the current state-of-the-art solutions.
翻译:应对数据中心应用的“数据中心税”和“杀手微秒”问题,提出了多种解决方案,包括智能NIC为基础的解决方案;然而,这些解决方案往往在网络和/或PCIe链接的通信中受到高压管理;为了应对当前解决方案的局限性,本文件提出ORCA,这是一个综合网络和架构共同设计解决方案,利用当前RDMA和新兴的缓存协调离芯连接技术。具体来说,ORCA由四个硬件和软件组成:(1)统一抽取由单方RDMA书写和缓存协调存储器记忆书写管理的机器间和内部通信;(2)高效通知向加速器提出的请求,辅之以缓存一致性;(3)缓存协调器结构直接处理NIC收到的请求;(4)调适用设备到主机数据传输,由DRAM和NVM组成的现代服务器存储系统,利用CPUs和PCIe的状态-艺术特征。我们用一个商业系统原型,对当前三种更高级的RDMA书写和缓存协调存储器-CA 30,在分销系统中提供低级的存储价值评估。