Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet of Things and social media. They are mainly characterized by volume, velocity, variety and veracity issues. Big data-related issues strongly challenge traditional data management and analysis systems. The concept of data lake was introduced to address them. A data lake is a large, raw data repository that stores and manages all company data bearing any format. However, the data lake concept remains ambiguous or fuzzy for many researchers and practitioners, who often confuse it with the Hadoop technology. Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes. We also discuss the pros and cons of data lakes and their design alternatives.
翻译:过去二十年来,我们目睹了世界数据产量的指数增长,所谓的大数据一般来自交易系统,更甚于来自物联网和社会媒体,其特征主要是数量、速度、多样性和真实性问题;大数据相关问题对传统数据管理和分析系统提出了巨大挑战;数据湖的概念是用来解决这些问题的;数据湖是一个庞大的原始数据储存库,储存和管理所有具有任何格式的公司数据;然而,数据湖概念对许多研究人员和从业人员来说仍然模糊不清,往往与哈多普技术混淆起来。因此,我们在本文件中提供了数据湖设计不同方法的全面最新技术,我们特别注重数据湖结构和元数据管理,这是成功数据湖的关键问题;我们还讨论了数据湖的利弊及其设计替代办法。