Applications that run in large-scale data center networks (DCNs) rely on the DCN's ability to deliver application requests in a performant manner. DCNs expose a complex design and operational space, and network designers and operators care how different options along this space affect application performance. One might run controlled experiments and measure the corresponding application-facing performance, but such experiments become progressively infeasible at a large scale, and simulations risk yielding inaccurate or incomplete results. Instead, we show that we can predict application-facing performance through more easily measured network metrics. For example, network telemetry metrics (e.g., link utilization) can predict application-facing metrics (e.g., transfer latency). Through large-scale measurements of production networks, we study the correlation between the two types of metrics, and construct predictive, interpretable models that serve as a suggestive guideline to network designers and operators. We show that no single network metric is universally the best predictor (even though some prior work has focused on a single predictor). We found that simple linear models often have the lowest error, while queueing-based models are better in a few cases.
翻译:暂无翻译