Huge amounts of data being generated continuously by digitally interconnected systems of humans, organizations and machines. Data comes in variety of formats including structured, unstructured and semi-structured, what makes it impossible to apply the same standard approaches, techniques and algorithms to manage and process this data. Fortunately, the enterprise level distributed platform named Hadoop Ecosystem exists. This paper explores Apache Hive component that provides full stack data managements functionality in terms of Data Definition, Data Manipulation and Data Processing. Hive is a data warehouse system, which works with structured data stored in tables. Since, Hive works on top the Hadoop HDSFS, it benefits from extraordinary feature of HDFS including Fault Tolerance, Reliability, High Availability, Scalability, etc. In addition, Hive can take advantage of distributed computing power of the cluster through assigning jobs to MapReduce, Tez and Spark engines to run complex queries. The paper is focused on studying of Hive Data Model and analysis of processing performance done by MapReduce and Tez.
翻译:数据来自各种格式,包括结构化、无结构化和半结构化,这使得无法应用同样的标准方法、技术和算法来管理和处理这些数据。幸运的是,企业一级的分布式平台Hadoop生态系统存在。本文探讨了Apache Hive部分,该部分在数据定义、数据处理和数据处理方面提供全堆数数据管理功能。它是一个数据仓系统,在表格中储存的结构化数据。自从Hadoop HDSFS顶部投入工作以来,它得益于HDFS的特殊特征,包括放任、可靠性、高可用性、可变性等等。此外,Hive通过向MapReduce、Tez和Spoker引擎分配任务来运行复杂的查询,Hive数据模型研究和分析MapReduce和Tez的处理性能。