Data is a precious resource in today's society, and is generated at an unprecedented and constantly growing pace. The need to store, analyze, and make data promptly available to a multitude of users introduces formidable challenges in modern software platforms. These challenges radically transformed all research fields that gravitate around data management and processing, with the introduction of distributed data-intensive systems that offer new programming models and implementation strategies to handle data characteristics such as its volume, the rate at which it is produced, its heterogeneity, and its distribution. Each data-intensive system brings its specific choices in terms of data model, usage assumptions, synchronization, processing strategy, deployment, guarantees in terms of consistency, fault tolerance, ordering. Yet, the problems data-intensive systems face and the solutions they propose are frequently overlapping. This paper proposes a unifying model that dissects the core functionalities of data-intensive systems, and precisely discusses alternative design and implementation strategies, pointing out their assumptions and implications. The model offers a common ground to understand and compare highly heterogeneous solutions, with the potential of fostering cross-fertilization across research communities and advancing the field. We apply our model by classifying tens of systems: an exercise that brings to interesting observations on the current trends in the domain of data-intensive systems and suggests open research directions.
翻译:数据是当今社会中的宝贵资源,以一种前所未有且不断增长的速度产生。将数据存储、分析和及时向众多用户提供的需求在现代软件平台中引入了巨大的挑战。这些挑战彻底改变了围绕数据管理和处理的所有研究领域,引入了分布式数据密集型系统,这些系统提供了新的编程模型和实现策略,以处理数据的特征,例如其容量,生产速率,异构性和分布。每个数据密集型系统在数据模型、使用假设、同步、处理策略、部署、在一致性、容错性、排序方面的保证等方面都有其特定选择。然而,数据密集型系统面临的问题和它们提出的解决方案经常重叠。本文提出了一个统一的模型,分解数据密集型系统的核心功能,并详细讨论替代设计和实现策略,指出其假设和影响。该模型提供了一个共同的基础,以了解并比较高度异构的解决方案,具有促进跨研究社区的相互作用和推进领域的潜力。我们通过分类数十个系统来应用我们的模型:这种练习带来了对数据密集型系统当前趋势的有趣观察,并提示了开放的研究方向。