Data lakes are becoming increasingly prevalent for big data management and data analytics. In contrast to traditional 'schema-on-write' approaches such as data warehouses, data lakes are repositories storing raw data in its original formats and providing a common access interface. Despite the strong interest raised from both academia and industry, there is a large body of ambiguity regarding the definition, functions and available technologies for data lakes. A complete, coherent picture of data lake challenges and solutions is still missing. This survey reviews the development, architectures, and systems of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the existing approaches and systems based on their provided functions for data lakes, which makes this survey a useful technical reference for designing, implementing and deploying data lakes. We hope that the thorough comparison of existing solutions and the discussion of open research challenges in this survey will motivate the future development of data lake research and practice.
翻译:大数据管理和数据分析日益普遍。数据湖泊与传统的“在天上的系统”方法(如数据仓库)不同,数据湖泊以原始格式储存原始数据,提供一个共同的存取界面。尽管学术界和工业界都对数据湖泊的定义、功能和现有技术有强烈的兴趣,但数据湖泊的定义、功能和可用技术方面仍有很多模糊之处。数据湖泊挑战和解决办法仍然缺乏完整、一致的图象。这一调查审查了数据湖泊的开发、结构和系统。我们全面概述了设计和建立数据湖泊的研究问题。我们根据为数据湖泊提供的功能对现有方法和系统进行了分类,从而使这一调查成为设计、实施和部署数据湖泊的有用技术参考。我们希望,对现有解决办法的彻底比较以及这一调查中公开研究挑战的讨论将推动今后发展数据湖泊研究和实践。