Prescriptive Performance Analysis (PPA) has shown to be more useful than traditional descriptive and diagnostic analyses for making sense of Big Data (BD) frameworks' performance. In practice, when processing large (RDF) graphs on top of relational BD systems, several design decisions emerge and cannot be decided automatically, e.g., the choice of the schema, the partitioning technique, and the storage formats. PPA, and in particular ranking functions, helps enable actionable insights on performance data, leading practitioners to an easier choice of the best way to deploy BD frameworks, especially for graph processing. However, the amount of experimental work required to implement PPA is still huge. In this paper, we present PAPyA 1, a library for implementing PPA that allows (1) preparing RDF graphs data for a processing pipeline over relational BD systems, (2) enables automatic ranking of the performance in a user-defined solution space of experimental dimensions; (3) allows user-defined flexible extensions in terms of systems to test and ranking methods. We showcase PAPyA on a set of experiments based on the SparkSQL framework. PAPyA simplifies the performance analytics of BD systems for processing large (RDF) graphs.We provide PAPyA as a public open-source library under an MIT license that will be a catalyst for designing new research prescriptive analytical techniques for BD applications.
翻译:描述性业绩分析(PPA)比传统的描述和诊断分析(PPA)更有用,使大数据(BD)框架的性能更有意义,实际上,当在关系性BD系统顶部处理大(RDF)图解时,出现了一些设计决定,不能自动决定,例如选择Schema、分区技术和储存格式。PPA,特别是排序功能,有助于对业绩数据进行可操作的洞察力,使执行人员更容易选择部署BD框架的最佳方法,特别是图表处理。然而,实施PPAPA所需的实验工作量仍然很大。在本文件中,我们提出PAPATIA 1,一个用于实施PPPAA,一个用于实施PPATIA的图书馆图案库,该图集为BDFA的大型分析应用提供一种业绩分析性能。