Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation.There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.
翻译:Hive 是Hadoop 生态系统中最成熟和最流行的数据仓库工具, 提供了 SQL 类界面。 它被许多互联网公司成功使用, 并展示了它在传统行业中大数据处理的价值。 但是, Smart Grid 应用程序中的企业大型数据处理系统通常需要复杂的商业逻辑, 并涉及许多数据操作操作, 如更新和删除。 Hive 无法在保存高查询性能的同时为这些操作提供足够支持。 使用 Hadoop 分布式文件系统( HDFS) 进行存储无法高效地执行数据操作, 而 HBase 上 Hive 的查询性能也很差, 尽管它能够支持更快的数据操作。 有一个基于 Hive 问题 Hive-5317 的项目支持更新操作, 但是它还没有在 Hive 的最新版本中完成 。 由于 ACID 符合要求的扩展应用在 HDFS 上采用相同的数据存储格式, 更新性能问题无法解决 。 在本文中, 我们建议使用一个混合存储模式, 将 HDFS 和 HBase 的随机写能力结合起来。 Hivelock 提供更好的数据操作支持, 并在 10 服务器上进行快速测试 。