In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large scale platforms. In this paper, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting data during analysis at scale. We focus on NOvA detector data in this case study, a large-scale HEP experiment generating many terabytes of data. The lessons learned from our case study inform the handling of similar datasets, thus expanding community knowledge related to this common data management task.
翻译:在高能物理(HEP)中,实验学家生成了大量数据,这些数据在分析时有助于我们更好地了解基本粒子及其相互作用。这些数据往往被许多小型档案所收集,给科学家带来了数据管理挑战。为了更好地便利数据管理、转移和分析大型平台,将数据进一步汇集到数量较少的大型文档中是有好处的。然而,这一翻译过程可能耗费大量的时间和资源,如果进行错误的翻译,在大型平台的分析中,所产生的综合文件在高度平行访问方面可能效率低下。在本文中,我们介绍了关于平行I/O战略和HDF5特性的案例研究,以缩短数据汇总时间,有效利用压缩,并确保在大规模分析期间有效获取由此产生的数据。我们在案例研究中侧重于NOVA探测器数据,这是一个大规模HEP实验,产生许多数据兆字节。从我们的案例研究中吸取的经验教训为类似数据集的处理提供了信息,从而扩大了与这一共同数据管理任务有关的社区知识。