Scientific datasets are known for their challenging storage demands and the associated processing pipelines that transform their information. Some of those processing tasks include filtering, cleansing, aggregation, normalization, and data format translation -- all of which generate even more data. In this paper, we present an infrastructure for the HDF5 file format that enables dataset values to be populated on the fly: task-related scripts can be attached into HDF5 files and only execute when the dataset is read by an application. We provide details on the software architecture that supports user-defined functions (UDFs) and how it integrates with hardware accelerators and computational storage. Moreover, we describe the built-in security model that limits the system resources a UDF can access. Last, we present several use cases that show how UDFs can be used to extend scientific datasets in ways that go beyond the original scope of this work.
翻译:科学数据集以其具有挑战性的存储需求和转换信息的相关处理管道而闻名。 其中一些处理任务包括过滤、 清理、 汇总、 常规化和数据格式翻译 -- -- 所有这些都产生更多的数据。 在本文中, 我们为 HDF5 文件格式提供了一个基础设施, 使得能够将数据集值包含在苍蝇上: 任务相关脚本可以附加在 HDF5 文件中, 并且只有在数据集被应用程序读取时才执行。 我们提供了关于支持用户定义功能的软件架构( UDFs) 的细节, 以及如何将其与硬件加速器和计算存储集成。 此外, 我们描述了限制UDF 访问的系统资源的内置安全模型。 最后, 我们用几个案例来说明如何使用UDFs来扩展科学数据集, 其方式超出了这项工作的最初范围 。