Data is a precious resource in today's society, and is generated at an unprecedented and constantly growing pace. The need to store, analyze, and make data promptly available to a multitude of users introduces formidable challenges in modern software platforms. These challenges radically transformed all research fields that gravitate around data management and processing, with the introduction of distributed data-intensive systems that offer new programming models and implementation strategies to handle data characteristics such as its volume, the rate at which it is produced, its heterogeneity, and its distribution. Each data-intensive system brings its specific choices in terms of data model, usage assumptions, synchronization, processing strategy, deployment, guarantees in terms of consistency, fault tolerance, ordering. Yet, the problems data-intensive systems face and the solutions they propose are frequently overlapping. This paper proposes a unifying model that dissects the core functionalities of data-intensive systems, and precisely discusses alternative design and implementation strategies, pointing out their assumptions and implications. The model offers a common ground to understand and compare highly heterogeneous solutions, with the potential of fostering cross-fertilization across research communities and advancing the field. We apply our model by classifying tens of systems: an exercise that brings to interesting observations on the current trends in the domain of data-intensive systems and suggests open research directions.
翻译:在现代软件平台上,需要储存、分析和迅速向众多用户提供数据,这带来了巨大的挑战。然而,数据密集型系统所面临的问题和它们提出的解决办法经常相互重叠。本文提出了一个统一模式,将数据密集型系统的核心功能分割开来,并准确地讨论其他设计和实施战略,指出其假设和影响。该模式为理解和比较高度混杂的解决办法提供了一个共同基础,并有可能促进跨研究界的相互交流和推进实地工作。我们运用了我们的模型,对10个系统进行了分类:一种做法,为当前数据密集型的观察提供了引人入胜的趋势。