基于大数据分析的互联网服务性能管理体系结构研究

项目名称： 基于大数据分析的互联网服务性能管理体系结构研究

项目编号： No.61472214

项目类型： 面上项目

立项/批准年度： 2015

项目学科： 自动化技术、计算机技术

项目作者： 裴丹

作者单位： 清华大学

项目金额： 86万元

中文摘要： 随着互联网逐渐深入到现代生活的方方面面，用户对互联网服务体验的要求越来越高。作为互联网生态系统里至关重要的一环,互联网内容提供商（ICP）的产品线、模块、以及软硬件资源的规模、耦合度、复杂度、动态性都在不断增加，同时来自用户的负载也在不断动态变化。如上特点使得ICP服务性能管理面临巨大的挑战：运维人员需要一整套高效工具检测并排查随时可能发生的（访问延迟增加、吞吐率下降、访问失败等）性能体验下降事件。针对如上挑战, 本项目提出了一个基于大数据分析的通用、实时、高准确率的ICP性能体验管理体系结构。它包含了四个主要模块:基于流的层次化KPI聚合与存储方法；差异性KPI的自适应异常检测引擎；面向海量运维事件的分布式抗噪性关联挖掘方法；基于故障传播链的分布式故障定位模型。本体系结构具有自适应、循环迭代、自学习的特点,能显著提升互联网服务质量管理的性能体验。

中文关键词： 互联网服务质量管理；大数据分析；异常检测；关联关系挖掘；故障排查

英文摘要： As people and business rely more and more on Internet for their daily life and work, they require higher and higher Internet service performance experience. Internet Content Providers (ICPs), as one of the most important components of the Internet ecosystem, are facing great challenges in meeting the performance requirements for the following two reasons. First, its scale can be sheer: the number of product lines, shared software modules, hardware/network resources continue to grow; so do their coupling dependency, complexity, and dynamics. Second, the user traffic demand to ICPs can be very dynamic due to many reasons such as seasonality and crowd events. As a result of any component failures, resource exhaustion, or demand change, the user-perceived performance can undergo various degrades, such as latency increase, throughput drop, access failure etc. Unfortunately, the current ICP practice to detect and troubleshoot these performance degrade are still relatively rudimentary and sometimes manual. This calls for a new architecture for ICPs to be able to automatically detect, isolate and troubleshoot all kinds of performance degrade events in real-time. To address above challenges, in this project we propose a generic, real-time, accurate architecture for ICPs to manage user-perceived performance, based on big data analytics. It consists of four separated but highly related components: 1) a flow-based hierarchical KPI clustering and storage system, which can provide fine-grained KPI monitoring but with small storage/computation overhead and fast query speed; 2) an adaptive anomaly detection engine that can learn each KPI's properties and then choose the appropriate anomaly detection algorithms and parameters for each KPI; 3) an offline, distributed association rule mining system that automatically calculate the co-occurrence relationship and even causal relationship between various fault, anomaly, and workflow events, without requiring 100% input data accuracy; 4) a root cause analysis system that takes events and aforementioned association rules to automatically figure out the chain of events (including the root cause) that have caused the interested events (such as performance degrade) to happen, and proposes actionable suggestions. The proposed architecture is adaptive, self-learning, and iterative, in that the one component might serve as input to another, and the latter component's results might provide valuable feedback for the former component to learn and adjust parameters to achieve better accuracy in the future. All these feedback are automatic within minimum manual involvement. We will show through this project that this architecture can greatly improve the user-perceived performance of the ICPs, and benefit the Internet ecosystem.

英文关键词： Internet service perquality management;Big data analytics;Anomaly detection;Association rule mining;Troubleshooting

成为VIP会员查看完整内容