To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the challenges in using big data frameworks, we first conduct an empirical study on 1,000 Apache Spark-related questions on Stack Overflow. We find that most of the challenges are related to data transformation and API usage. To solve these challenges, we design an approach, which assists developers with understanding and debugging data processing in Spark. Our approach leverages statistical sampling to minimize performance overhead, and provides intermediate information and hint messages for each data processing step of a chained method pipeline. The preliminary evaluation of our approach shows that it has low performance overhead and we receive good feedback from developers.
翻译:为了更高效地处理数据,大数据框架为开发者提供数据抽象数据,然而,由于抽象化,开发者可能面临许多挑战来理解和调试数据处理代码。为了发现使用大数据框架的挑战,我们首先对Stack overflow的1,000 Apache Spark相关问题进行实证研究。我们发现,大部分挑战都与数据转换和使用API有关。为了应对这些挑战,我们设计了一种方法,帮助开发者理解和调试斯帕克的数据处理。我们的方法利用统计抽样来尽量减少业绩管理费,并为链式方法管道的每个数据处理步骤提供中间信息和提示信息。我们对方法的初步评估表明,它的业绩管理费很低,我们从开发者那里得到了良好的反馈。