In general, large datasets enable deep learning models to perform with good accuracy and generalizability. However, massive high-fidelity simulation datasets (from molecular chemistry, astrophysics, computational fluid dynamics (CFD), etc. can be challenging to curate due to dimensionality and storage constraints. Lossy compression algorithms can help mitigate limitations from storage, as long as the overall data fidelity is preserved. To illustrate this point, we demonstrate that deep learning models, trained and tested on data from a petascale CFD simulation, are robust to errors introduced during lossy compression in a semantic segmentation problem. Our results demonstrate that lossy compression algorithms offer a realistic pathway for exposing high-fidelity scientific data to open-source data repositories for building community datasets. In this paper, we outline, construct, and evaluate the requirements for establishing a big data framework, demonstrated at https://blastnet.github.io/, for scientific machine learning.
翻译:一般而言,大型数据集使深层学习模型能够以准确性和一般性良好的方式运行。然而,大规模高纤维模拟数据集(来自分子化学、天体物理学、计算流体动态等)可能会由于维度和储存限制而对剖析具有挑战性。 失传压缩算法可以帮助减轻储存限制, 只要数据的整体可靠性得到保持。 为了说明这一点,我们证明深层学习模型、经过培训和测试的精细的CFD模拟数据,对于在语义分解问题中损失压缩过程中引入的错误是强大的。 我们的结果表明,损失压缩算法为将高纤维科学数据暴露于建立社区数据集的开源数据储存库提供了现实的途径。 在本文中,我们概述了、构建和评估建立大数据框架的要求,在 https://blastnet.github.io/ 上展示,用于科学机器学习。