项目名称: HDFS读、写性能概率建模与模型迁移方法研究
项目编号: No.61502379
项目类型: 青年科学基金项目
立项/批准年度: 2016
项目学科: 自动化技术、计算机技术
项目作者: 董博
作者单位: 西安交通大学
项目金额: 20万元
中文摘要: HDFS性能建模是云计算领域研究热点之一。HDFS性能具有独特的概率分布特征,蕴含着有用的信息。但当前尚未开展HDFS性能概率建模研究;并且HDFS性能建模主要使用单一建模方法,性能模型迁移方法的研究处在起步阶段。为此,本项目研究一种“实验建模、分析建模与迁移学习结合”的HDFS读、写性能概率建模方法。.首先,基于HDFS工作机理,对文件大小域(0, BS]范围的文件大小(BS代表块长),采用实验建模,提出基于“特征指标值预测-概率密度还原”的性能概率密度函数估计方法;对文件大小域(BS, +∞)的文件大小,采用分析建模,提出基于概率密度叠加的性能建模方法。其次,对新平台的HDFS性能概率建模,提出基于实例迁移的HDFS性能概率建模方法。.以上方法能够建立不同平台HDFS读、写性能在文件大小域的概率模型,减少建模成本、提高建模效率,对其它数据密集型文件系统性能建模具有借鉴意义。
中文关键词: Hadoop分布式文件系统;性能建模;概率建模;迁移学习
英文摘要: Performance modeling for Hadoop Distributed File System (HDFS) has attracted significant attentions. Through a lot of experiments, we have investigated that HDFS performance has a unique probability distribution, which contains useful information. However, there is no research on probabilistic modeling for HDFS performance. Moreover, HDFS performance modeling is mainly based on a single modeling method, such as experimental modeling or analytical modeling, each of which has its own weakness. Model transferring between different platforms is still a challenge for HDFS performance modeling. Thus, this project aims to propose an “experimental modeling + analytical modeling + transfer learning”-based probabilistic modeling method for HDFS performance.. First, for file sizes in the range of (0, BS], experimental modeling methodology is used for HDFS performance modeling. Specially, a “Probability Density Estimation” method is proposed, which has two steps: (1) predicting characteristic index values and (2) restituting probability density function. Second, for file sizes in the range of (BS, +∞), analytical modeling methodology is used, and a “Probability Density Superposition”-based performance modeling method is proposed. Then, for HDFS performance modeling of a new platform, transfer learning methodology is used, and an “instance-based transfer learning” method is proposed to build approximate probabilistic models.. The proposed methods can build probabilistic models for HDFS performance of different platforms, and will reduce the cost of modeling and improve the efficiency of modeling. Moreover, the methods can also be useful for other data-intensive file systems.
英文关键词: HDFS;Performance Modeling;Probabilistic Modeling;Transfer Learning