I/O efficiency is crucial to productivity in scientific computing, but the increasing complexity of the system and the applications makes it difficult for practitioners to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and underperform after being deployed. We analyze multiple years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models underperform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.
翻译:I/O效率对于科学计算中的生产力至关重要,但系统和应用程序的日益复杂性使得从业人员难以理解和优化I/O大规模行为。数据驱动机学习I/O吞吐模型提供了一种解决办法:它们可以用来找出瓶颈,自动进行I/O调试,或以最低限度的人力干预优化工作时间安排。不幸的是,目前最先进的I/O模型不够强大,无法用于生产使用,部署后工作表现不佳。我们分析了两个领导级HPC平台上多年的应用、调度仪表和存储系统日志,以了解I/O模型在实践中表现不佳的原因。我们提出了由五类I/O建模错误组成的分类学:应用和系统建模不良、数据集覆盖面不足、I/O争议和I/O噪音。我们开发了用于量化每一类别的亮度测试,使研究人员能够缩小失败模式,强化I/O吞吐模型,并改进HPC今后几代的伐木和分析工具。