Cloud computing is the backbone of the digital society. Digital banking, media, communication, gaming, and many others depend on cloud services. Unfortunately, cloud services may fail, leading to damaged services, unhappy users, and perhaps millions of dollars lost for companies. Understanding a cloud service failure requires a detailed report on why and how the service failed. Previous work studies how cloud services fail using logs published by cloud operators. However, information is lacking on how users perceive and experience cloud failures. Therefore, we collect and characterize the data for user-reported cloud failures from Down Detector for three cloud service providers over three years. We count and analyze time patterns in the user reports, and derive failures from those user reports and characterize their duration and interarrival time. We characterize provider-reported cloud failures and compare the results with the characterization of user-reported failures. The comparison reveals the information of how users perceive failures and how much of the failures are reported by cloud service providers. Overall, this work provides a characterization of user- and provider-reported cloud failures and compares them with each other.
翻译:云计算是数字社会的骨干。 数字银行、 媒体、 通信、 赌博、 以及其他许多东西都依赖于云服务。 不幸的是, 云服务可能失败, 导致服务受损, 不愉快的用户, 以及公司损失的数百万美元。 了解云服务失败需要详细报告为什么和如何导致服务失败。 先前的工作研究如何利用云操作员公布的日志使云服务失败。 然而, 缺乏关于用户如何感知和经历云失败的信息。 因此, 我们收集并定性三个云服务供应商三年来用户报告的云失灵数据。 我们在用户报告中计算和分析时间模式,并从这些用户报告中找出故障,并描述其持续时间和到来的时间。 我们确定由供应商报告的云失灵,并将结果与用户报告失败的特征进行比较。 比较表明用户如何发现故障,以及云服务供应商报告的失败程度如何。 总体而言, 这项工作提供了用户和供应商报告的云失灵的特征,并相互比较。