Cloud networks are difficult to monitor because they grow rapidly and the budgets for monitoring them are limited. We propose a framework for estimating network metrics, such as latency and packet loss, with guarantees on estimation errors for a fixed monitoring budget. Our proposed algorithms produce a distribution of probes across network paths, which we then monitor; and are based on A- and E-optimal experimental designs in statistics. Unfortunately, these designs are too computationally costly to use at production scale. We propose their scalable and near-optimal approximations based on the Frank-Wolfe algorithm. We validate our approaches in simulation on real network topologies, and also using a production probing system in a real cloud network. We show major gains in reducing the probing budget compared to both production and academic baselines, while maintaining low estimation errors, even with very low probing budgets.
翻译:云层网络难以监测,因为其增长迅速,监测云层网络的预算有限。我们提出了一个估算网络测量尺度的框架,例如潜伏和包装损失,保证固定监测预算的估计误差。我们提议的算法产生跨网络路径的探测器分布,然后我们加以监测;并以A-和E-最佳统计实验设计为基础。不幸的是,这些设计计算成本太高,无法用于生产规模。我们提出了基于弗兰克-沃夫算法的可缩放和近于最佳的近似近似值。我们验证了我们模拟真实网络地形的方法,并在真正的云网络中使用了生产测试系统。我们显示,在降低预算与生产和学术基线相比,同时保持低估计误差方面有重大收益,即使预算非常低。