基于Java、Python与Scala的Apache Spark大数据处理性能对比分析 (Comparative analysis of large data processing in Apache Spark using Java, Python and Scala)

from arxiv, CITI 2025, 3rd International Workshop on Computer Information Technologies in Industry 4.0, June 11-12, 2025, Ternopil, Ukraine. The article includes 10 pages, 5 figures, 9 tables

During the study, the results of a comparative analysis of the process of handling large datasets using the Apache Spark platform in Java, Python, and Scala programming languages were obtained. Although prior works have focused on individual stages, comprehensive comparisons of full ETL workflows across programming languages using Apache Iceberg remain limited. The analysis was performed by executing several operations, including downloading data from CSV files, transforming and loading it into an Apache Iceberg analytical table. It was found that the performance of the Spark algorithm varies significantly depending on the amount of data and the programming language used. When processing a 5-megabyte CSV file, the best result was achieved in Python: 6.71 seconds, which is superior to Scala's score of 9.13 seconds and Java's time of 9.62 seconds. For processing a large CSV file of 1.6 gigabytes, all programming languages demonstrated similar results: the fastest performance was showed in Python: 46.34 seconds, while Scala and Java showed results of 47.72 and 50.56 seconds, respectively. When performing a more complex operation that involved combining two CSV files into a single dataset for further loading into an Apache Iceberg table, Scala demonstrated the highest performance, at 374.42 seconds. Java processing was completed in 379.8 seconds, while Python was the least efficient, with a runtime of 398.32 seconds. It follows that the programming language significantly affects the efficiency of data processing by the Apache Spark algorithm, with Scala and Java being more productive for processing large amounts of data and complex operations, while Python demonstrates an advantage in working with small amounts of data. The results obtained can be useful for optimizing data handling processes depending on specific performance requirements and the amount of information being processed.

翻译：本研究通过对比分析在Apache Spark平台上使用Java、Python和Scala编程语言处理大规模数据集的性能表现，获得了相关实验结果。尽管已有研究关注个别处理阶段，但基于Apache Iceberg、跨编程语言的完整ETL工作流综合比较仍较为有限。分析通过执行多项操作完成，包括从CSV文件读取数据、进行数据转换并将其加载至Apache Iceberg分析表中。研究发现Spark算法的性能表现受数据规模与所用编程语言的显著影响。在处理5兆字节CSV文件时，Python表现最佳：耗时6.71秒，优于Scala的9.13秒和Java的9.62秒。在处理1.6吉字节大型CSV文件时，各编程语言表现相近：Python最快（46.34秒），Scala与Java分别为47.72秒和50.56秒。在执行更复杂的操作——将两个CSV文件合并为单一数据集并加载至Apache Iceberg表时，Scala展现出最高性能（374.42秒），Java处理耗时379.8秒，而Python效率最低（398.32秒）。由此可见，编程语言对Apache Spark算法的数据处理效率具有显著影响：Scala与Java在处理大规模数据及复杂操作时更具效能优势，而Python在处理小规模数据时表现更佳。本研究结果可根据具体性能需求与处理数据规模，为优化数据处理流程提供参考依据。