The recent influx of open scientific data has contributed to the transitioning of scientific computing from compute intensive to data intensive. Whereas many Big Data frameworks exist that minimize the cost of data transfers, few scientific applications integrate these frameworks or adopt data-placement strategies to mitigate the costs. Scientific applications commonly rely on well-established command-line tools that would require complete reinstrumentation in order to incorporate existing frameworks. We developed Sea as a means to enable data-placement strategies for scientific applications executing on HPC clusters without the need to reinstrument workflows. Sea leverages GNU C library interception to intercept POSIX-compliant file system calls made by the applications. We designed a performance model and evaluated the performance of Sea on a synthetic data-intensive application processing a representative neuroimaging dataset (the Big Brain). Our results demonstrate that Sea significantly improves performance, up to a factor of 3$\times$.
翻译:最近,开放科学数据的流入促进了科学计算从计算密集数据向数据密集数据的转变。虽然有许多大数据框架可以最大限度地降低数据传输的成本,但很少有科学应用将这些框架整合起来,或采用数据替换战略来降低成本。科学应用通常依赖完善的指令线工具,这些工具需要完全重建,以纳入现有的框架。我们开发了海洋,作为在高聚苯乙烯集群上执行科学应用的数据配置战略,无需再造工作流程。海利用GNU C图书馆拦截拦截来拦截应用中符合POSIX的档案系统电话。我们设计了一个性能模型,评估了海洋在合成数据密集型应用程序中处理具有代表性的神经成像数据集(大脑)的性能。我们的结果显示,海的性能显著改善,高达3美元。