The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing meta-data, e.g. dataset, file access logs, since 2015. These records represent a valuable, yet scarcely investigated, set of information that needs to be cleaned, categorized and analyzed. CMS can use this information to discover useful patterns and enhance the overall efficiency of the distributed data, improve CPU and site utilization as well as tasks completion time. Here we present evaluation of Apache Spark platform for CMS needs. We discuss two main use-cases CMS analytics and ML studies where efficient process billions of records stored on HDFS plays an important role. We demonstrate that both Scala and Python (PySpark) APIs can be successfully used to execute extremely I/O intensive queries and provide valuable data insight from collected meta-data.
翻译:计算机管理系统实验自2015年起开始收集大量计算机元数据,例如数据集、文件存取日志等。这些记录代表着宝贵的、但很少调查、需要清理、分类和分析的一套信息。计算机管理系统可以使用这些信息发现有用的模式,提高所分发数据的总体效率,改进CPU和网站利用率以及任务完成时间。这里我们介绍对用于分析目的的不同开放源码、用户级工具的原始储存量的评估结果。我们讨论了两个主要的使用案例CMS分析法和ML研究,在这些研究中,有效处理存储在HDFS上的数十亿份记录发挥了重要作用。我们证明,Scala和Python(PySpark)的API可以成功地用于进行极深入的I/O查询,并从收集的元数据中提供宝贵的数据洞察。