Although big data has been discussed for some years, it still has many research challenges, especially the variety of data. It poses a huge difficulty to efficiently integrate, access, and query the large volume of diverse data in information silos with the traditional 'schema-on-write' approaches such as data warehouses. Data lakes have been proposed as a solution to this problem. They are repositories storing raw data in its original formats and providing a common access interface. This survey reviews the development, definition, and architectures of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the existing data lake systems based on their provided functions, which makes this survey a useful technical reference for designing, implementing and applying data lakes. We hope that the thorough comparison of existing solutions and the discussion of open research challenges in this survey would motivate the future development of data lake research and practice.
翻译:尽管多年来一直在讨论海量数据,但它仍面临许多研究挑战,特别是数据种类繁多,在有效地整合、存取和查询信息仓中的大量不同数据与数据仓等传统的“陆陆陆空系统”方法存在巨大困难。提出了数据湖作为解决这一问题的办法。数据湖以原始格式储存原始数据,提供一个共同的存取界面。这项调查审查了数据湖的开发、定义和结构。我们全面概述了设计和建造数据湖的研究问题。我们根据所提供的功能对现有数据湖系统进行分类,从而使这项调查成为设计、实施和应用数据湖的有用技术参考。我们希望对现有解决办法进行彻底比较,并讨论调查中的公开研究挑战,将推动数据湖研究和实践的未来发展。