With new emerging technologies, such as satellites and drones, archaeologists collect data over large areas. However, it becomes difficult to process such data in time. Archaeological data also have many different formats (images, texts, sensor data) and can be structured, semi-structured and unstructured. Such variety makes data difficult to collect, store, manage, search and analyze effectively. A few approaches have been proposed, but none of them covers the full data lifecycle nor provides an efficient data management system. Hence, we propose the use of a data lake to provide centralized data stores to host heterogeneous data, as well as tools for data quality checking, cleaning, transformation, and analysis. In this paper, we propose a generic, flexible and complete data lake architecture. Our metadata management system exploits goldMEDAL, which is the most complete metadata model currently available. Finally, we detail the concrete implementation of this architecture dedicated to an archaeological project.
翻译:考古数据也有许多不同的格式(图像、文本、传感器数据),而且可以结构化、半结构化和无结构化。这种多样性使得数据难以有效地收集、储存、管理、搜索和分析。提出了几种办法,但没有一种办法涵盖整个数据生命周期,也没有提供有效的数据管理系统。因此,我们提议使用一个数据湖来提供集中的数据储存库,以存放各种数据,以及数据质量检查、清洁、转换和分析工具。我们在本文件中提议建立一个通用的、灵活的和完整的数据湖结构。我们的元数据管理系统利用目前最完整的元数据模型GoldMEDAL。最后,我们详细介绍了专门用于考古项目的这一结构的具体实施情况。