Computer systems use many scheduling heuristics to allocate resources. Understanding their performance properties is hard because it requires a representative workload and extensive code instrumentation. As a result, widely deployed schedulers can make poor decisions leading to unpredictable performance. We propose a methodology to study their specification using automated verification tools to search for performance issues over a large set of workloads, system characteristics and implementation details. Our key insight is that much of the complexity of the system can be overapproximated without oversimplification, allowing system and heuristic developers to quickly and confidently characterize the performance of their designs. We showcase the power of our methodology through four case studies. First, we produce bounds on the performance of two classical algorithms, SRPT scheduling and work stealing, under practical assumptions. Then, we create a model that identifies two bugs in the Linux CFS scheduler. Finally, we verify a recently made observation that TCP unfairness can cause some ML training workloads to spontaneously converge to a state of high network utilization.
翻译:我们提出一种方法,用自动核查工具来研究其规格,以在大量工作量、系统特点和执行细节方面寻找性能问题。我们的主要见解是,系统的许多复杂之处可能过于接近,而不过分简化,使系统和超自然开发商能够快速和自信地描述其设计性能。我们通过四个案例研究展示了我们方法的力量。首先,我们根据实际假设,制作了两种古典算法(SRPT)的性能界限。然后,我们创建了一种模型,在Linux CFS排程器中找出两个错误。最后,我们核实了最近的一项观察,即TCP的不公平性可导致一些ML培训工作量自发地与高网络利用率的状态趋同。