Oak Ridge National Laboratory (ORNL) experimental neutron science facilities produce 1.2\,TB a day of raw event-based data that is stored using the standard metadata-rich NeXus schema built on top of the HDF5 file format. Performance of several data reduction workflows is largely determined by the amount of time spent on the loading and processing algorithms in Mantid, an open-source data analysis framework used across several neutron sciences facilities around the world. The present work introduces new data management algorithms to address identified input output (I/O) bottlenecks on Mantid. First, we introduce an in-memory binary-tree metadata index that resemble NeXus data access patterns to provide a scalable search and extraction mechanism. Second, data encapsulation in Mantid algorithms is optimally redesigned to reduce the total compute and memory runtime footprint associated with metadata I/O reconstruction tasks. Results from this work show speed ups in wall-clock time on ORNL data reduction workflows, ranging from 11\% to 30\% depending on the complexity of the targeted instrument-specific data. Nevertheless, we highlight the need for more research to address reduction challenges as experimental data volumes increase.
翻译:Oak Ridge国家实验中子科学实验室(ORNL)实验性中子科学设施(OrnL)每天产生1.2\,TB 原始事件数据,该原始事件数据储存日数以HDF5文件格式之上的富集元元数据NeXus schema为基础。若干数据减少工作流程的绩效主要取决于曼蒂德的载荷和处理算法所花费的时间,曼蒂德是一个开放源数据分析框架,世界各地若干中子科学设施都使用这一框架。目前的工作引入了新的数据管理算法,以解决曼蒂德的确定输入输出(I/O)瓶颈问题。首先,我们引入了类似于NeXus数据访问模式的模拟双轨元数据指数,以提供一个可缩放的搜索和提取机制。第二,对曼蒂德算法中的数据封装进行了最佳的重新设计,以减少与元数据I/O重建任务相关的总计算和记忆运行时间足迹。这项工作的结果显示,ORNL数据减少工作流程在倒时加快速度,从11 ⁇ 至30 ⁇ 不等,这取决于特定仪器数据的复杂性。然而的数据减少量的实验性研究需要进一步减少数据。然而。然而,我们强调需要更多研究。