In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent single-link simulations; we carefully combine these link-level simulations to produce accurate estimates of end-to-end flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On large-scale networks where ns-3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with 99th percentile accuracy within 9% for flow completion times.
翻译:在本文中,我们考虑如何为大型数据中心网络提供流动水平尾部悬浮性能的快速估计。网络尾部悬浮性能往往是云度应用性能的关键衡量标准,它可能受到多种因素的影响,包括网络负荷、阵间交通阻塞、交通恐慌、流量分布、过量订阅和地形不对称。Ns-3和OMNET+++等网络模拟器可以提供准确的答案,但很难平行,要花小时或几天来回答问题,如果问一个甚至中等规模的单一配置问题的话。最近与 MimicNet的合作显示如何利用机器行为学习来改进模拟性能,但代价是每次配置要包括一个长的培训步骤,还要假设工作量和地形统一性能,而实际上通常无法维持这种假设。我们通过开发一套技术来弥补这一差距,为具有一般交通矩阵和地形学的大型网络提供快速性能估计。一个关键步骤是将问题分解成大量平行的独立单一链接模拟;我们仔细结合了这些链接性模拟,一个整体级的模拟过程模拟,要花费一个连续两年,要花费一个链接级网络进行精确的模拟,要花费一个连续的网络的计算,要花费一个时间来进行连续的网络的计算。